Railway: How a GCP account suspension took Railway down for 8 hours
On May 19, 2026, Railway suffered a SEV-1 outage after an automated GCP account suspension exposed a hidden control-plane dependency. Here’s what happened, how the outage cascaded across its multi-cloud platform, and the lessons teams can apply to strengthen routing, recovery, and provider resilience.
Company and product
Railway is an infrastructure platform designed to help developers deploy, host, and manage applications, databases, services, and APIs without the need to manually configure complex cloud infrastructure.
To ensure high availability and resilience, the platform runs workloads across multiple environments, including Google Cloud Platform (GCP), Amazon Web Services (AWS), and its own Railway Metal.
Despite this distributed architecture, Railway’s network control plane maintained a hard dependency on Google Cloud. The edge routing layer still relied on a GCP-hosted control plane API for workload discoverability and routing table updates.
What happened
At 22:20 UTC on May 19, an automated action by Google Cloud incorrectly suspended Railway’s production account without proactive outreach or warning. This immediately disabled Railway’s GCP-dependent infrastructure, including the dashboard, API, databases, compute infrastructure, and parts of the network layer.
Users began seeing 503 errors on the dashboard and API, including “no healthy upstream” and “unconditional drop overload” messages. Many users were also unable to log in.
The incident is notable because it exposed a hidden weakness in Railway’s multi-cloud architecture. Initially, customer workloads running on AWS and Railway Metal remained online. However, Railway’s edge proxies depended on routing information from a GCP-hosted network control plane API. Once the cached route data expired, the edge layer could no longer resolve routes to active instances.
As a result, workloads outside GCP began returning 404 errors, making all Railway workloads across all regions unreachable at peak impact.
Recovery introduced additional complexity. Restoring GCP account access did not immediately bring the platform back online. Railway had to recover persistent disks, compute instances, networking, edge routing, and orchestration layer by layer. During recovery, a burst of retried requests also triggered GitHub rate limits on Railway’s OAuth and webhook integrations, temporarily creating additional blockers for logins and builds.
Timeline
- May 19, 22:10 UTC: Automated monitoring detected API health check failures and paged on-call engineers.
- May 19, 22:11 UTC: The Railway dashboard began returning 503 errors; users were unable to log in.
- May 19, 22:19 UTC: Railway identified the root cause: Google Cloud had suspended the production account.
- May 19, 22:22 UTC: Railway filed a P0 ticket with Google Cloud and directly engaged their GCP account manager.
- May 19, 22:29 UTC: Railway officially declared an incident. GCP account access was restored, but compute instances and persistent disks remained down.
- May 19, 22:35 UTC: Cached network routes began expiring. Workloads on Railway Metal and AWS started returning 404 errors due to route resolution failures.
- May 19, 23:09 UTC: The first persistent disk came back online.
- May 19, 23:54 UTC: All persistent disks reached a ready state, though the network remained down.
- May 20, 00:39 UTC: Disks were confirmed ready; recovery was blocked pending Google Cloud networking restoration.
- May 20, 01:30 UTC: Compute instances began recovering.
- May 20, 01:38 UTC: Networking was restored, and edge traffic resumed.
- May 20, 01:57 UTC: Orchestration and build infrastructure came back online. Deployments were temporarily paused to prevent the platform from being overwhelmed.
- May 20, 02:04 UTC: Compute hosts were incrementally brought back online.
- May 20, 02:47 UTC: GitHub began rate-limiting Railway’s OAuth and webhook integrations, temporarily blocking some logins and builds.
- May 20, 02:55 UTC: The dashboard became accessible.
- May 20, 03:59 UTC: Deployments began processing across all tiers.
- May 20, 04:00 UTC: API, dashboard, and OAuth endpoints were confirmed operational. Remaining workloads continued to restore.
- May 20, 06:14 UTC: The incident moved to the monitoring phase.
- May 20, 07:58 UTC: The incident was fully resolved.
Time to Detect (TTD): Within minutes. Monitoring detected API health check failures at 22:10 UTC, and engineers identified the root cause by 22:19 UTC.
Time to Resolve (TTR): Approximately 8 hours to reach the monitoring phase, and 9 hours 38 minutes until full resolution.
Who was affected?
The outage impacted Railway customers across all regions. Users lost access to the dashboard, API, deployment management, and build triggers. Initially confined to GCP-hosted infrastructure, the impact expanded to AWS and Railway Metal workloads once edge routing caches expired. At the incident's peak, all workloads across all regions were unreachable.
How did Railway respond?
Railway’s response was fast and heavily relied on automated monitoring. Once engineers were paged for API health check failures, they identified the GCP suspension within nine minutes and immediately escalated the issue via a P0 ticket and their GCP account manager.
Recognizing that restoring account access wouldn't instantly fix the platform, the team executed a controlled, layer-by-layer recovery. They systematically validated persistent disks, compute instances, networking, and edge routing. To prevent a secondary outage from queued operations, Railway strategically paused deployments and drained the backlog gradually. They also managed the secondary GitHub rate-limiting issue, which temporarily affected OAuth logins and webhook-based builds. Crucially, Railway took full ownership of the architectural flaw that allowed a single vendor's action to cause a global outage.
How did Railway communicate?
Railway published a comprehensive, transparent incident report. They detailed the timeline, impact, root cause, and their step-by-step recovery process. Instead of deflecting all blame to Google Cloud, Railway explicitly acknowledged the hidden dependency in their multi-cloud network. By owning their vendor choices and stating that customer uptime is ultimately their responsibility, Railway delivered a highly credible and actionable postmortem.
Key learnings for other teams
- Multi-cloud does not automatically mean resilient: Distributing workloads across GCP, AWS, and bare metal is ineffective if a critical control-plane dependency is tied to a single provider. Verify that your failover architecture survives the loss of an entire provider, not just a zone.
- Remove single-provider dependencies from the hot path: Ensure that routing, service discovery, authentication, and deployment orchestration do not rely on one cloud provider to keep customer workloads reachable.
- Cache expiration turns partial outages into global ones: Cached route data offered Railway a temporary buffer. Understand your cache Time-To-Live (TTL) settings and exactly how the system behaves when those caches expire.
- Recovery must be staged and controlled: Flipping the power back on doesn't fix a complex system. Develop and test recovery runbooks for restoring infrastructure layer by layer (disks, network, compute, customer-facing services).
- Anticipate secondary failures during recovery: Retry storms and queued backlogs will overwhelm recovering systems. Account for webhook bursts and third-party API rate limits (like GitHub) in your disaster recovery plans.
Quick summary
On May 19, 2026, Railway suffered a SEV-1 platform-wide outage after Google Cloud erroneously suspended its production account. While Railway utilizes AWS and bare metal alongside GCP, the outage cascaded globally because edge proxies relied on a GCP-hosted control plane API to route traffic. Once cached routes expired, all workloads became unreachable. Following a staged, 8-hour recovery process, Railway committed to removing single-vendor dependencies from their data plane and redesigning their control plane for true multi-cloud resilience.
How ilert can help
Incidents like Railway’s demonstrate how a provider-side failure can rapidly escalate into a platform-wide outage due to hidden dependencies. ilert equips engineering and infrastructure teams to respond faster, coordinate seamlessly, and maintain clear communication during high-stakes events.
- Intelligent alerting: Route API health checks, dashboard failures, edge routing errors, and cloud-provider or monitoring alerts into a single incident workflow to help teams detect cascading outages earlier.
- Context-aware alert grouping: Group related alerts from the control plane, edge proxies, deployment systems, and third-party integrations into one incident, reducing noise and giving responders a clearer view of the failure chain.
- Automated escalation: Use escalation policies and on-call schedules to automatically notify the right infrastructure, platform, or network engineers when critical routing or control-plane alerts are triggered.
- Status pages: Keep customers informed with clear incident updates during prolonged provider-related outages, while engineering teams stay focused on restoration.
- Post-Incident review: Reconstruct the chain of events using incident timelines, alert history, responder notes, and stakeholder updates to identify hidden dependencies before the next outage.

