Postmortem Library

Clerk: How a hidden cloud database migration caused a full authentication outage

This article examines the March 2026 Clerk outage, where a failed Cloud SQL migration triggered cascading system failures. It breaks down how infrastructure-level latency led to lock contention and authentication failures, and what engineering teams can learn about resilience and dependency on managed cloud services.

Company and product

Clerk is an authentication and user management platform designed for modern web applications. It provides APIs and SDKs for handling user sessions, login flows, and identity management. Because authentication sits at the entry point of most applications, any disruption in Clerk’s services directly affects user access, making it a critical dependency in production environments. The platform relies heavily on managed cloud infrastructure, including hosted PostgreSQL databases, to ensure scalability and availability. This dependency, while operationally efficient, introduces risk when upstream providers experience failures.

What happened

The incident began with a sudden increase in latency and 5xx errors across Clerk’s APIs. Internal monitoring quickly revealed elevated database lock contention. However, the database itself appeared healthy in terms of CPU usage, creating a misleading signal:

  • Queries were not failing outright
  • But they were taking significantly longer to complete

This latency buildup caused a cascading failure:

  • Slow queries increased lock contention
  • Lock contention delayed transactions
  • Delayed transactions saturated compute resources
  • Saturation caused incoming requests to fail

As compute capacity was exhausted, the system began returning 429 (Too Many Requests) errors, which were propagated directly from an internal service instead of being transformed into 500 errors. An attempt to offload reads to database replicas did not alleviate the issue, indicating that the problem was not query volume but underlying storage latency. Eventually, Clerk activated Origin Outage Mode, a resilience mechanism designed to preserve session validation. While this restored core session functionality, the system remained degraded until the root issue resolved.

Root cause

The root cause was traced to a failed live migration of a Cloud SQL virtual machine performed by Google Cloud. Live migrations are typically designed to be transparent and zero-downtime. However, in this case:

  • The migration introduced significant disk latency
  • This caused read delays without increased write activity
  • The latency triggered database lock contention, which cascaded into a full service outage

Critically, these migrations:

  • Are not visible to or scheduled by Clerk
  • Provide no advance notice
  • Cannot be safely failed over during execution

Replica failover is unsafe during a live migration, meaning teams cannot rely on traditional failover strategies while the operation is ongoing. This significantly increases dependency risk on the underlying provider.

This makes live migrations a high-risk, opaque failure mode for systems relying on managed databases.

Timeline

  • 15:57 UTC: Alerts triggered for elevated latency and 5xx errors; incident declared
  • 16:10 UTC: Origin Outage Mode enabled; session validation restored
  • 16:23 UTC: Incident resolves automatically as migration completes
  • Post-incident: Confirmation of root cause from Google Cloud

Time to Detect (TTD): ~1 minute
Time to Resolve (TTR): ~26 minutes

Who was affected?

All applications relying on Clerk for authentication during the incident window experienced degraded or failed login and session requests. The impact was particularly severe for:

  • Applications with high authentication request volumes
  • Systems without fallback authentication mechanisms
  • Workloads dependent on real-time session validation

How did Clerk respond?

Clerk responded quickly by:

  • Detecting the issue via monitoring and alerting systems
  • Attempting mitigation through read-replica routing
  • Activating Origin Outage Mode to preserve session functionality
  • Escalating the issue to Google Cloud

Following the incident, Clerk initiated several long-term remediation steps:

  • Requesting database pinning to avoid future live migrations
  • Planning to replace live migrations with controlled replica promotion
  • Evaluating alternative database strategies, including self-hosted options

Additionally, Clerk shifted a significant portion of its engineering resources toward reliability-focused initiatives, prioritizing system stability over feature development.

How did Clerk communicate?

Clerk communicated transparently through a detailed public postmortem.

The communication included:

  • A clear timeline of events
  • Honest acknowledgment of responsibility
  • Technical root cause explanation
  • Concrete remediation plans

Notably, Clerk chose to explicitly name their upstream provider, emphasizing accountability and clarity over abstraction.

Key learnings for other teams

  • Opaque infrastructure risks. Managed services can introduce hidden operations (such as live migrations) that are outside your control but can still impact system availability.
  • Latency is as dangerous as downtime. Systems may appear operational while increased latency silently degrades performance and triggers cascading failures.
  • Cascading failure patterns. A chain like database latency → lock contention → compute saturation → request failure is a critical failure pattern that must be monitored and mitigated.
  • Failover limitations in managed systems. Replica failover may be unsafe or unavailable during certain provider operations (such as live migrations), removing a key recovery mechanism during incidents.
  • Resilience mechanisms must scale under stress. Fallback modes and safeguards must remain functional even when compute resources are constrained.
  • Avoid single-provider dependency risks. Critical infrastructure should include strategies for redundancy, failover, or partial independence from a single cloud provider.

Quick Summary

Clerk experienced a SEV-1 outage caused by a failed Google Cloud SQL live migration. The resulting disk latency triggered lock contention and compute saturation, leading to widespread authentication failures. The incident highlights the risks of opaque cloud operations and the limitations of failover during provider-controlled events.

How ilert can help

Incidents like Clerk’s outage highlight how quickly infrastructure-level failures can cascade while remaining difficult to diagnose. While ilert cannot prevent upstream provider failures, it significantly reduces detection time, investigation friction, and response coordination complexity.

  • Advanced alerting: Correlate latency spikes, error rates, and saturation signals to surface early indicators of cascading failures, helping teams identify systemic issues faster.
  • On-call management: Automatically route SEV-1 incidents to the right engineers (backend, infrastructure, database specialists), ensuring immediate and structured response.
  • ChatOps coordination: Centralize investigation steps, mitigation attempts, and hypotheses in real time, avoiding fragmented communication during high-pressure incidents.
  • Status pages: Communicate authentication outages and degraded performance clearly to end users, reducing uncertainty and support load during incidents.

Find more Postmortems:
Ready to elevate your incident management?
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Our Cookie Policy
We use cookies to improve your experience, analyze site traffic and for marketing. Learn more in our Privacy Policy.
Open Preferences
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.