Postmortem Library

Slack outage May 2025: database routing issue caused downtime

Slack experienced a global service outage triggered by a database routing misconfiguration, impacting users worldwide. Explore a detailed breakdown of what went wrong, the root causes behind the disruption, Slack's official response and recovery actions, and key lessons learned for improving reliability and resilience.

Link to the source

Company

Slack is a cloud-based collaboration platform widely used by teams and enterprises for real-time messaging, file sharing, and workflow automation. Its core services include channels, direct messages, integrations, and enterprise-grade features like SSO and compliance features.

What happened during the Slack outage?

Slack experienced a widespread outage lasting 1 hour and 58 minutes. The disruption affected a global subset of users, who were unable to send messages, load channels, or access core features like threads, canvases, and activity logs.

The root cause was a breakdown in communication between Slack’s web application and its database routing layer. Recent infrastructure growth had outpaced static configurations, preventing routing updates from reaching the web layer. As a result, clients were unable to access live gateway data, leading to elevated error rates and degraded functionality across the platform.

Timeline

When did the Slack incident start?

The incident began at 15:00 PDT on May 12, 2025, when Slack began seeing elevated error rates. Internal monitoring detected a spike in failed message sends, broken threads, and channel load failures.

‍

How was the Slack incident detected and escalated?

Slack’s engineering teams were alerted as real-time communication began to fail across web and desktop clients. Initial updates were posted on the Slack Status page, but the scope and cause were still under investigation.

‍

When was the Slack outage resolved?

By 16:00 PDT, engineers had identified the misconfigured database routing layer as the source. Infrastructure adjustments were rolled out to restore the web app’s ability to locate database gateways.

By 16:58 PDT, users began regaining access to Slack features. Backend queues were drained, and performance stabilised shortly after.

A full resolution was confirmed by May 13, 2:52 PM GMT+2.

MTTD: ~10 minutes

MTTR: ~1 hour 58 minutes

‍

Who was affected by the Slack outage, and how bad was it?

The Slack outage on May 12 affected a significant portion of global users across all regions, with varying degrees of service disruption.

‍

Impacted features:

Messaging
Channel and thread access
Slack Activity logs
Slack Canvas
App launches and integrations

How did Slack communicate during the outage?

Slack’s engineering teams shared regular updates throughout the incident, beginning shortly after the first signs of service degradation. Messaging was clear and acknowledged the impact across core features, with progress communicated as mitigation steps were rolled out. While the cadence of updates was steady, early messaging lacked specificity around scope and affected users, and recovery timelines shifted without consistent clarification.

What patterns did the Slack outage reveal?

The outage revealed recurring risks in scaled infrastructure systems:

Infrastructure scale exceeding configuration thresholds.
Missing feedback loops between routing and web clients.
Limited visibility into internal service discovery health.

Quick summary

On May 12, 2025, Slack experienced a nearly two-hour global outage due to a misconfigured database routing layer. Messaging, channels, and app functionality broke down for a significant portion of users. Slack responded quickly with infrastructure fixes and consistent communication. The outage highlighted risks tied to silent configuration limits, visibility gaps, and routing layer health.

Find more Postmortems:

SEV-2

16.05.2025