Postmortem Library

AWS: US-EAST-1 load-balancer glitch sparks internet-scale outage

On 20 Oct 2025, a faulty NLB health monitor in US-EAST-1 cascaded into DNS/service-discovery failures, degrading thousands of apps for ~15 hours.

Link to the source

Company and product

Amazon Web Services (AWS) is the largest public cloud provider by market share, accounting for roughly 30–31% of global cloud infrastructure spend in 2024, ahead of Microsoft Azure and Google Cloud. This scale makes AWS a de-facto utility for the internet’s compute, storage, and networking needs.

‍

AWS reports “1,000,000+ active customers every month” and regularly describes its community as “millions of active customers” across startups, enterprises, and the public sector. Household-name adopters span Netflix, BMW, Pfizer, and Canva, with tens of thousands of partners in the AWS Partner Network that extend and integrate services.

‍

Economically, AWS is one of the world’s largest enterprise IT businesses in its own right. In Q2 2025, AWS delivered $30.9B in revenue and $10.2B in operating income, highlighting how central the platform has become to modern digital services – from ecommerce checkouts and banking APIs to media streaming and AI workloads.

‍

Technically, AWS’s core building blocks – Amazon EC2, S3, EBS, RDS/Aurora, DynamoDB, Lambda, SQS/SNS, API Gateway, Route 53, and Elastic Load Balancing – are augmented by edge networking and a private backbone spanning 6+ million km of fiber. This combination enables low-latency global apps, resilient multi-AZ designs, and rapid scale-up during traffic surges.

‍

AWS’s significance matters for this incident because so many digital services concentrate on AWS (often in a small set of Regions like US-EAST-1), and disruptions ripple across finance, gaming, retail, media, healthcare, and government simultaneously. The platform’s scale, customer base, and ecosystem mean an outage isn’t just a vendor issue – it’s a systemic internet event.

What happened?

On October 20, 2025, AWS experienced a major, multi-service outage centered on US-EAST-1 (N. Virginia). The incident caused elevated error rates and latency across load balancers and dependent services, which cascaded into DNS resolution failures and service errors for customers. AWS later traced the incident to a malfunction in the health monitoring subsystem for Network Load Balancers (NLB) within the EC2 internal network; several outlets also reported DNS resolution side-effects.

Timeline (Europe/Berlin time)

Start (first widespread impact): 09:11 CEST, Oct 20 – outage begins in US-EAST-1, manifesting as DNS/ELB errors and API failures.
Initial AWS acknowledgment: early morning CEST via AWS Health Dashboard, noting increased error rates/latency in US-EAST-1. (Public Health is AWS’s authoritative channel for service incidents.) TTD ≈ 0–10 min.
Partial mitigation: 12:35 CEST, Oct 20 – AWS reports the underlying problem mitigated; recovery progresses for impacted services. MTTM ≈ 3h24m.
Residual impact: Delayed message processing/backlogs persisted in parts of the fleet during the afternoon/evening.
Full recovery declared: 00:01 CEST, Oct 21 – AWS states all systems are operational, with services steadily normalizing. TTR (to full green) ≈ 14h50m.

Who was affected?

Thousands of organizations and popular apps, including Snapchat, Reddit, Fortnite/Epic Games, Zoom, Venmo/Coinbase/Robinhood, as well as Amazon’s own Alexa, Ring, and portions of retail operations, experienced disruptions. UK public websites (e.g., HMRC) and banks (Lloyds/Halifax) also saw disturbances. Impact spanned consumer apps, finance, education, and e-commerce.

How AWS responded

AWS contained the incident by isolating and correcting a faulty load-balancer health-monitoring component in US-EAST-1, while applying DNS and workload-routing mitigations to stabilize service discovery and API calls; for recovery, teams prioritized restoring core control-plane paths (instance launches, Lambda invocation, SQS polling) and drained backlogs to eliminate delayed processing; post-incident, AWS confirmed services were back to normal, noting some residual message-processing lag during tail recovery.

How AWS communicated

During the outage, AWS used the AWS Health Dashboard (Service Health) as the primary communication channel, posting region-specific updates that emphasized the US-EAST-1 scope and the multi-service impact; in parallel, major media outlets—most notably Reuters and The Guardian—amplified key details on cause, blast radius, and recovery progress, which improved situational awareness for customers and stakeholders who didn’t have console access.

Key learnings for other companies

Design for regional blast-radius. Treat US-EAST-1 as a single failure domain. Use multi-region active/active for customer-facing endpoints; fail over DNS and stateful stores with data tier replication and conflict-aware design.
Decouple from provider DNS/ELB assumptions. Add conditional DNS failover, independent health checks, and client-side timeouts/retries to avoid tight coupling to one region’s control plane.
Backpressure + graceful degradation. Implement circuit breakers, queue backpressure, and read-only/limited modes so core functions stay available while non-critical features shed load. (Residual message delays were a major pain point.)
Status ingestion and automation. Programmatically ingest AWS Health events via EventBridge; auto-flip feature flags, scale failover capacity, and pre-warm alternative regions when Health signals ELB/DNS incidents.
Runbooks that assume partial control-plane loss. Practice drills for EC2/Lambda launch failures, SQS lag, and DNS resolution errors; pre-compute failover DNS records and traffic policies.
Comms readiness. Maintain a standalone, multi-cloud status page and customer comms macros for cloud-provider incidents; push succinct updates every 20–30 minutes during major events.

Quick summary

A major US-EAST-1 incident at AWS triggered a global disruption on October 20–21, degrading or taking offline thousands of apps and sites across social, finance, gaming, government, retail, and more. Beginning around 09:11CEST and not fully green until 00:01CEST (with partial relief at 12:35 CEST), the outage stemmed from a fault in EC2’s load-balancer health-monitoring that cascaded into DNS/service-discovery instability. AWS applied mitigations, restored core control-plane paths, and drained backlogs to recover services.

‍

The significance is hard to overstate: because so many critical digital services concentrate on AWS – often anchored in US-EAST-1 – a single-region failure became an internet-scale event, exposing systemic concentration risk and the operational necessity of multi-region, provider-agnostic resilience.

How ilert can help

To keep critical alerts moving even when a vendor stumbles, ilert is multi-sourced by design. We partner with multiple telecom providers—Twilio, Vonage, and MessageBird—so SMS and voice calls continue through alternate routes if any single provider experiences an outage. This redundancy helps ensure on-call teams receive pages when they matter most. Here are a few more points on how ilert can help you in times of such major incidents:

‍

Automated cloud signals: Ingest AWS Health events into ilert to enrich alerts with context and automatically open incidents.
Multi-region readiness: Use ilert’s routing rules and on-call schedules to page the right teams (SRE, networking, app owners) the moment ELB or DNS degradation is detected.
Communication: Publish status updates and send customer notifications directly from ilert – automatically and with the help of AI to expedite the process – keeping a steady cadence while engineers work through backlogs.
Postmortems and analytics: Produce structured reviews, track MTTR, and surface recurring weak spots (e.g., service discovery, control-plane dependencies) to drive concrete improvements.

Find more Postmortems:

SEV-0

18.09.2025

Optus: Fatal impact of firewall upgrade on 000 Calls

A critical Optus upgrade fault blocked ~600 emergency “000” calls for ~13 hours across multiple states – an incident linked to multiple deaths. What we learned.

SEV-2

03.09.2025

1Password: Sign-in outage blocks logins

A one-hour 1Password outage on Aug 5, 2025 blocked sign-ins. Here’s what happened: the timeline, comms, and actionable lessons to harden your incident response.

SEV-2

28.07.2025