Postmortem Library

Clerk: How a hidden cloud database migration caused a full authentication outage

This article examines the March 2026 Clerk outage, where a failed Cloud SQL migration triggered cascading system failures. It breaks down how infrastructure-level latency led to lock contention and authentication failures, and what engineering teams can learn about resilience and dependency on managed cloud services.

Link to the source

Company and product

Clerk is an authentication and user management platform designed for modern web applications. It provides APIs and SDKs for handling user sessions, login flows, and identity management. Because authentication sits at the entry point of most applications, any disruption in Clerk’s services directly affects user access, making it a critical dependency in production environments. The platform relies heavily on managed cloud infrastructure, including hosted PostgreSQL databases, to ensure scalability and availability. This dependency, while operationally efficient, introduces risk when upstream providers experience failures.

What happened

The incident began with a sudden increase in latency and 5xx errors across Clerk’s APIs. Internal monitoring quickly revealed elevated database lock contention. However, the database itself appeared healthy in terms of CPU usage, creating a misleading signal:

Queries were not failing outright
But they were taking significantly longer to complete

This latency buildup caused a cascading failure:

Slow queries increased lock contention
Lock contention delayed transactions
Delayed transactions saturated compute resources
Saturation caused incoming requests to fail

As compute capacity was exhausted, the system began returning 429 (Too Many Requests) errors, which were propagated directly from an internal service instead of being transformed into 500 errors. An attempt to offload reads to database replicas did not alleviate the issue, indicating that the problem was not query volume but underlying storage latency. Eventually, Clerk activated Origin Outage Mode, a resilience mechanism designed to preserve session validation. While this restored core session functionality, the system remained degraded until the root issue resolved.

Root cause

‍

The root cause was traced to a failed live migration of a Cloud SQL virtual machine performed by Google Cloud. Live migrations are typically designed to be transparent and zero-downtime. However, in this case:

The migration introduced significant disk latency
This caused read delays without increased write activity
The latency triggered database lock contention, which cascaded into a full service outage

Critically, these migrations:

Are not visible to or scheduled by Clerk
Provide no advance notice
Cannot be safely failed over during execution

Replica failover is unsafe during a live migration, meaning teams cannot rely on traditional failover strategies while the operation is ongoing. This significantly increases dependency risk on the underlying provider.

This makes live migrations a high-risk, opaque failure mode for systems relying on managed databases.

Timeline

15:57 UTC: Alerts triggered for elevated latency and 5xx errors; incident declared
16:10 UTC: Origin Outage Mode enabled; session validation restored
16:23 UTC: Incident resolves automatically as migration completes
Post-incident: Confirmation of root cause from Google Cloud

Time to Detect (TTD): ~1 minute
Time to Resolve (TTR): ~26 minutes

Who was affected?

All applications relying on Clerk for authentication during the incident window experienced degraded or failed login and session requests. The impact was particularly severe for:

Applications with high authentication request volumes
Systems without fallback authentication mechanisms
Workloads dependent on real-time session validation

How did Clerk respond?

Clerk responded quickly by:

Detecting the issue via monitoring and alerting systems
Attempting mitigation through read-replica routing
Activating Origin Outage Mode to preserve session functionality
Escalating the issue to Google Cloud

Following the incident, Clerk initiated several long-term remediation steps:

Requesting database pinning to avoid future live migrations
Planning to replace live migrations with controlled replica promotion
Evaluating alternative database strategies, including self-hosted options

Additionally, Clerk shifted a significant portion of its engineering resources toward reliability-focused initiatives, prioritizing system stability over feature development.

How did Clerk communicate?

Clerk communicated transparently through a detailed public postmortem.

The communication included:

A clear timeline of events
Honest acknowledgment of responsibility
Technical root cause explanation
Concrete remediation plans
‍

Notably, Clerk chose to explicitly name their upstream provider, emphasizing accountability and clarity over abstraction.

Key learnings for other teams

Opaque infrastructure risks. Managed services can introduce hidden operations (such as live migrations) that are outside your control but can still impact system availability.
Latency is as dangerous as downtime. Systems may appear operational while increased latency silently degrades performance and triggers cascading failures.
Cascading failure patterns. A chain like database latency → lock contention → compute saturation → request failure is a critical failure pattern that must be monitored and mitigated.
Failover limitations in managed systems. Replica failover may be unsafe or unavailable during certain provider operations (such as live migrations), removing a key recovery mechanism during incidents.
Resilience mechanisms must scale under stress. Fallback modes and safeguards must remain functional even when compute resources are constrained.
Avoid single-provider dependency risks. Critical infrastructure should include strategies for redundancy, failover, or partial independence from a single cloud provider.

Quick Summary

Clerk experienced a SEV-1 outage caused by a failed Google Cloud SQL live migration. The resulting disk latency triggered lock contention and compute saturation, leading to widespread authentication failures. The incident highlights the risks of opaque cloud operations and the limitations of failover during provider-controlled events.

How ilert can help

Incidents like Clerk’s outage highlight how quickly infrastructure-level failures can cascade while remaining difficult to diagnose. While ilert cannot prevent upstream provider failures, it significantly reduces detection time, investigation friction, and response coordination complexity.

‍

Advanced alerting: Correlate latency spikes, error rates, and saturation signals to surface early indicators of cascading failures, helping teams identify systemic issues faster.
On-call management: Automatically route SEV-1 incidents to the right engineers (backend, infrastructure, database specialists), ensuring immediate and structured response.
ChatOps coordination: Centralize investigation steps, mitigation attempts, and hypotheses in real time, avoiding fragmented communication during high-pressure incidents.
Status pages: Communicate authentication outages and degraded performance clearly to end users, reducing uncertainty and support load during incidents.

‍

Find more Postmortems:

SEV-1

19.05.2026

Railway: How a GCP account suspension took Railway down for 8 hours

On May 19, 2026, Railway experienced a platform-wide service disruption lasting approximately eight hours. The SEV-1 outage began when Google Cloud incorrectly placed Railway’s production account into a suspended status. This action instantly disabled Railway’s GCP-hosted infrastructure, taking down the dashboard, API, databases, compute infrastructure, and critical network components. While Railway operates a multi-cloud architecture spanning GCP, AWS, and Railway Metal, the outage quickly cascaded globally. A hidden architectural dependency—a GCP-hosted network control plane—prevented edge proxies from refreshing routing tables. Once cached routes expired, customer services across all cloud providers became unreachable. This incident serves as a stark reminder that multi-cloud infrastructure does not guarantee resilience if a critical control-plane dependency relies on a single provider.

SEV-1

06.04.2026

Bluesky: Decoding the loopback death spiral and missing concurrency limits

On Monday, April 6, 2026, Bluesky experienced a major service disruption that caused intermittent downtime for approximately 50% of its user base for roughly 8 hours. The incident was the result of a "death spiral" triggered by resource exhaustion in the platform's data plane. The crisis began when an internal service started sending large batch requests to an unoptimized RPC handler, leading to total ephemeral port exhaustion and a catastrophic failure of the Go runtime. While the service was eventually stabilized via a creative networking "band-aid," the event highlighted critical gaps in concurrency management and observability.

SEV-1

20.02.2026

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

On February 20, 2026, Cloudflare experienced a SEV-1 outage affecting customers using its Bring Your Own IP (BYOIP) service. The incident was triggered by a faulty internal cleanup task that unintentionally withdrew customer IP prefixes from the Internet via BGP. As a result, affected services became unreachable, causing connection timeouts and failures across Cloudflare-powered applications. The outage lasted several hours and highlighted the risks of automated production tasks without strong safeguards, validation, and blast-radius controls.

Ready to elevate your incident management?

Start for free

Clerk: How a hidden cloud database migration caused a full authentication outage

Company and product

What happened

Root cause

Timeline

Who was affected?

How did Clerk respond?

How did Clerk communicate?

Key learnings for other teams

Quick Summary

How ilert can help

Railway: How a GCP account suspension took Railway down for 8 hours

Bluesky: Decoding the loopback death spiral and missing concurrency limits

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

The solution for operation teams.