Intercom: Empty database routing map triggers total US service outage
This article examines the January 2026 Intercom outage in the US region, where a database routing failure caused a total service blackout. We explore how a logic bug in a sharded database layer can disconnect an entire application from its data.
On January 9, 2026, Intercom experienced a critical outage in its US-hosted region, lasting approximately 71 minutes. The incident resulted in total service unavailability, preventing customers from accessing the Intercom Inbox, Messenger, and APIs. The root cause was a software bug within the database routing layer (Vitess/PlanetScale), which erroneously applied an empty routing configuration (VSchema), effectively "orphaning" the application from its data shards.
Company and product
Intercom is a leading AI-first customer service platform that provides businesses with tools for chat, support automation, and customer engagement. Serving over 25,000 customers, Intercom processes massive volumes of real-time data to power instant messaging and support workflows.
To handle this scale, Intercom utilizes PlanetScale, a database platform built on Vitess, an open-source database clustering system for horizontal scaling of MySQL. This architecture allows Intercom to shard data across many database nodes, using a "VSchema" (routing map) to tell the application exactly where specific customer data is located.
What happened
The outage was triggered during a standard administrative operation to move tables for testing. Under normal circumstances, this is a safe, routine task. However, a logic regression in a recently deployed version of Vitess (v22) fundamentally changed how the system handled failed operations.
When the table-move command failed due to transient networking issues, the Vitess rollback logic was triggered. For sharded keyspaces, the system erroneously decided it needed to reapply a "previous" routing map that it hadn't actually saved. Consequently, it applied an empty VSchema to the production environment. Without a valid map, the database proxy layer (VTGates) could no longer route queries to the correct shards. The application remained online but could not perform any database operations, leading to a total failure of all Intercom services in the US.
Timeline
- When did the incident start? January 9, 2026, 19:39 UTC.
- How was the incident detected and escalated? At 19:40 UTC, "Inbox Heartbeat" anomaly alarms triggered, alerting the on-call engineers and an incident commander within 60 seconds of impact.
- When was the incident resolved? 20:50 UTC, after PlanetScale engineers manually reapplied a valid VSchema snapshot.
TTD (Time to Detect): 1 minute.
TTR (Time to Resolve): 71 minutes.
Who was affected?
All customers and end-users hosted in Intercom’s US region were affected. This included the inability to send or receive messages, load help centers, or use the Intercom API. Customers in other regions (e.g., Europe/Australia) remained unaffected.
How did Intercom respond?
Intercom’s engineering team immediately engaged with PlanetScale to investigate the data layer. Once they identified that the routing map had been wiped, PlanetScale engineers worked to push a snapshotted VSchema from one hour prior to the incident. Once the valid configuration was propagated to the proxy fleet, connectivity was restored instantly. Following the recovery, Intercom initiated a rollback of all database infrastructure to the previous stable version (v21) to prevent a recurrence.
How did Intercom communicate?
Intercom maintained active communication via the company's status page. Within 24 hours of the incident, they published a highly transparent technical write-up detailing the specific Vitess Pull Request that caused the regression and their plan for hardening the topology resilience.
Key learnings for other teams
- Validate rollback logic: In complex sharded systems, a "safe" rollback can be more dangerous than the original failure. Ensure that "undo" operations are tested as rigorously as "do" operations.
- Schema integrity checks: Implement safeguards to prevent the application of "empty" or radically different routing maps to production. A "minimum valid size" check for a VSchema could have prevented this propagation.
- Explicit upgrade triggers: Intercom found that a separate maintenance action (branch unfreezing) unintentionally triggered a pending version upgrade. Infrastructure pipelines should require explicit confirmation for major version changes.
Summary
A logic bug in Vitess v22 caused Intercom's database routing map to be overwritten with an empty configuration during a failed routine task. This led to a 71-minute total US outage. Service was restored by manually reapplying a VSchema snapshot.
How ilert сan help
Infrastructure failures at the database layer require rapid detection and coordinated response. ilert helps teams manage such SEV-1 events by:
- Heartbeat monitoring: Use heartbeat checks to ensure core services (such as Intercom's "Inbox Heartbeat") are actively processing data.
- On-call escalation: Automatically page the right database and SRE teams via Voice, SMS, or chat when regional anomalies are detected.
- Incident postmortems: Use ilert’s postmortem features to document these technical deep-dives and share key learnings with your organization.
