Postmortem Library

Railway: How a GCP account suspension took Railway down for 8 hours

On May 19, 2026, Railway suffered a SEV-1 outage after an automated GCP account suspension exposed a hidden control-plane dependency. Here’s what happened, how the outage cascaded across its multi-cloud platform, and the lessons teams can apply to strengthen routing, recovery, and provider resilience.

Link to the source

Company and product

Railway is an infrastructure platform designed to help developers deploy, host, and manage applications, databases, services, and APIs without the need to manually configure complex cloud infrastructure.

To ensure high availability and resilience, the platform runs workloads across multiple environments, including Google Cloud Platform (GCP), Amazon Web Services (AWS), and its own Railway Metal.

Despite this distributed architecture, Railway’s network control plane maintained a hard dependency on Google Cloud. The edge routing layer still relied on a GCP-hosted control plane API for workload discoverability and routing table updates.

What happened

At 22:20 UTC on May 19, an automated action by Google Cloud incorrectly suspended Railway’s production account without proactive outreach or warning. This immediately disabled Railway’s GCP-dependent infrastructure, including the dashboard, API, databases, compute infrastructure, and parts of the network layer.

Users began seeing 503 errors on the dashboard and API, including “no healthy upstream” and “unconditional drop overload” messages. Many users were also unable to log in.

The incident is notable because it exposed a hidden weakness in Railway’s multi-cloud architecture. Initially, customer workloads running on AWS and Railway Metal remained online. However, Railway’s edge proxies depended on routing information from a GCP-hosted network control plane API. Once the cached route data expired, the edge layer could no longer resolve routes to active instances.

As a result, workloads outside GCP began returning 404 errors, making all Railway workloads across all regions unreachable at peak impact.

Recovery introduced additional complexity. Restoring GCP account access did not immediately bring the platform back online. Railway had to recover persistent disks, compute instances, networking, edge routing, and orchestration layer by layer. During recovery, a burst of retried requests also triggered GitHub rate limits on Railway’s OAuth and webhook integrations, temporarily creating additional blockers for logins and builds.

Timeline

May 19, 22:10 UTC: Automated monitoring detected API health check failures and paged on-call engineers.
May 19, 22:11 UTC: The Railway dashboard began returning 503 errors; users were unable to log in.
May 19, 22:19 UTC: Railway identified the root cause: Google Cloud had suspended the production account.
May 19, 22:22 UTC: Railway filed a P0 ticket with Google Cloud and directly engaged their GCP account manager.
May 19, 22:29 UTC: Railway officially declared an incident. GCP account access was restored, but compute instances and persistent disks remained down.
May 19, 22:35 UTC: Cached network routes began expiring. Workloads on Railway Metal and AWS started returning 404 errors due to route resolution failures.
May 19, 23:09 UTC: The first persistent disk came back online.
May 19, 23:54 UTC: All persistent disks reached a ready state, though the network remained down.
May 20, 00:39 UTC: Disks were confirmed ready; recovery was blocked pending Google Cloud networking restoration.
May 20, 01:30 UTC: Compute instances began recovering.
May 20, 01:38 UTC: Networking was restored, and edge traffic resumed.
May 20, 01:57 UTC: Orchestration and build infrastructure came back online. Deployments were temporarily paused to prevent the platform from being overwhelmed.
May 20, 02:04 UTC: Compute hosts were incrementally brought back online.
May 20, 02:47 UTC: GitHub began rate-limiting Railway’s OAuth and webhook integrations, temporarily blocking some logins and builds.
May 20, 02:55 UTC: The dashboard became accessible.
May 20, 03:59 UTC: Deployments began processing across all tiers.
May 20, 04:00 UTC: API, dashboard, and OAuth endpoints were confirmed operational. Remaining workloads continued to restore.
May 20, 06:14 UTC: The incident moved to the monitoring phase.
May 20, 07:58 UTC: The incident was fully resolved.

‍

Time to Detect (TTD): Within minutes. Monitoring detected API health check failures at 22:10 UTC, and engineers identified the root cause by 22:19 UTC.

Time to Resolve (TTR): Approximately 8 hours to reach the monitoring phase, and 9 hours 38 minutes until full resolution.

Who was affected?

The outage impacted Railway customers across all regions. Users lost access to the dashboard, API, deployment management, and build triggers. Initially confined to GCP-hosted infrastructure, the impact expanded to AWS and Railway Metal workloads once edge routing caches expired. At the incident's peak, all workloads across all regions were unreachable.

‍How did Railway respond?

Railway’s response was fast and heavily relied on automated monitoring. Once engineers were paged for API health check failures, they identified the GCP suspension within nine minutes and immediately escalated the issue via a P0 ticket and their GCP account manager.

Recognizing that restoring account access wouldn't instantly fix the platform, the team executed a controlled, layer-by-layer recovery. They systematically validated persistent disks, compute instances, networking, and edge routing. To prevent a secondary outage from queued operations, Railway strategically paused deployments and drained the backlog gradually. They also managed the secondary GitHub rate-limiting issue, which temporarily affected OAuth logins and webhook-based builds. Crucially, Railway took full ownership of the architectural flaw that allowed a single vendor's action to cause a global outage.

How did Railway communicate?

Railway published a comprehensive, transparent incident report. They detailed the timeline, impact, root cause, and their step-by-step recovery process. Instead of deflecting all blame to Google Cloud, Railway explicitly acknowledged the hidden dependency in their multi-cloud network. By owning their vendor choices and stating that customer uptime is ultimately their responsibility, Railway delivered a highly credible and actionable postmortem.

Key learnings for other teams

Multi-cloud does not automatically mean resilient: Distributing workloads across GCP, AWS, and bare metal is ineffective if a critical control-plane dependency is tied to a single provider. Verify that your failover architecture survives the loss of an entire provider, not just a zone.
Remove single-provider dependencies from the hot path: Ensure that routing, service discovery, authentication, and deployment orchestration do not rely on one cloud provider to keep customer workloads reachable.
Cache expiration turns partial outages into global ones: Cached route data offered Railway a temporary buffer. Understand your cache Time-To-Live (TTL) settings and exactly how the system behaves when those caches expire.
Recovery must be staged and controlled: Flipping the power back on doesn't fix a complex system. Develop and test recovery runbooks for restoring infrastructure layer by layer (disks, network, compute, customer-facing services).
Anticipate secondary failures during recovery: Retry storms and queued backlogs will overwhelm recovering systems. Account for webhook bursts and third-party API rate limits (like GitHub) in your disaster recovery plans.

Quick summary

On May 19, 2026, Railway suffered a SEV-1 platform-wide outage after Google Cloud erroneously suspended its production account. While Railway utilizes AWS and bare metal alongside GCP, the outage cascaded globally because edge proxies relied on a GCP-hosted control plane API to route traffic. Once cached routes expired, all workloads became unreachable. Following a staged, 8-hour recovery process, Railway committed to removing single-vendor dependencies from their data plane and redesigning their control plane for true multi-cloud resilience.

How ilert can help

Incidents like Railway’s demonstrate how a provider-side failure can rapidly escalate into a platform-wide outage due to hidden dependencies. ilert equips engineering and infrastructure teams to respond faster, coordinate seamlessly, and maintain clear communication during high-stakes events.

Intelligent alerting: Route API health checks, dashboard failures, edge routing errors, and cloud-provider or monitoring alerts into a single incident workflow to help teams detect cascading outages earlier.
Context-aware alert grouping: Group related alerts from the control plane, edge proxies, deployment systems, and third-party integrations into one incident, reducing noise and giving responders a clearer view of the failure chain.
Automated escalation: Use escalation policies and on-call schedules to automatically notify the right infrastructure, platform, or network engineers when critical routing or control-plane alerts are triggered.
Status pages: Keep customers informed with clear incident updates during prolonged provider-related outages, while engineering teams stay focused on restoration.
Post-Incident review: Reconstruct the chain of events using incident timelines, alert history, responder notes, and stakeholder updates to identify hidden dependencies before the next outage.

Find more Postmortems:

SEV-1

06.04.2026

Bluesky: Decoding the loopback death spiral and missing concurrency limits

On Monday, April 6, 2026, Bluesky experienced a major service disruption that caused intermittent downtime for approximately 50% of its user base for roughly 8 hours. The incident was the result of a "death spiral" triggered by resource exhaustion in the platform's data plane. The crisis began when an internal service started sending large batch requests to an unoptimized RPC handler, leading to total ephemeral port exhaustion and a catastrophic failure of the Go runtime. While the service was eventually stabilized via a creative networking "band-aid," the event highlighted critical gaps in concurrency management and observability.

SEV-1

20.02.2026

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

On February 20, 2026, Cloudflare experienced a SEV-1 outage affecting customers using its Bring Your Own IP (BYOIP) service. The incident was triggered by a faulty internal cleanup task that unintentionally withdrew customer IP prefixes from the Internet via BGP. As a result, affected services became unreachable, causing connection timeouts and failures across Cloudflare-powered applications. The outage lasted several hours and highlighted the risks of automated production tasks without strong safeguards, validation, and blast-radius controls.

SEV-1

10.03.2026

Clerk: How a hidden cloud database migration caused a full authentication outage

On March 10, 2026, Clerk experienced a critical outage triggered by a failed Google Cloud SQL live migration. The incident introduced severe disk latency, which led to database lock contention, compute saturation, and widespread API failures. As authentication requests began returning errors, users were unable to log in or manage sessions, resulting in a full service disruption.

Ready to elevate your incident management?

Start for free

Railway: How a GCP account suspension took Railway down for 8 hours

Company and product

What happened

Timeline

Who was affected?

‍How did Railway respond?

How did Railway communicate?

Key learnings for other teams

Quick summary

How ilert can help

Bluesky: Decoding the loopback death spiral and missing concurrency limits

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

Clerk: How a hidden cloud database migration caused a full authentication outage

The solution for operation teams.