Postmortem Library

Bluesky: Decoding the loopback death spiral and missing concurrency limits

This article examines the April 2026 Bluesky outage, where a missing concurrency limit in an internal RPC handler triggered port exhaustion, logging overload, and recurring AppView crashes. We explore how large batch requests led to a backend “death spiral,” how engineers stabilized the service, and what teams can learn about concurrency control, observability, and failure isolation.

Link to the source

Company and product

Bluesky is a decentralized social media platform powered by the AT Protocol. It operates an open ecosystem where users can maintain portable identities across different service providers. A core component of this infrastructure is the AppView, which aggregates global network data to serve user feeds and profiles.

To handle massive scale, the AppView utilizes a high-performance data plane that relies on ScyllaDB and a memcached layer. This caching strategy is designed to minimize database load, ensuring that millions of concurrent requests can be served with sub-millisecond latency.

What happened

The incident was triggered by a newly deployed internal service that began sending GetPostRecord requests containing batches of 15,000 to 20,000 URIs. While the request frequency was low (under 3 per second), the sheer size of the batches overwhelmed the system's ability to manage connections.

Because the specific RPC handler lacked a concurrency limit, the system attempted to launch 20,000 simultaneous goroutines per request to query memcached. This led to the following chain of events:

Thousands of memcached connections were opened and closed rapidly, exhausting available TCP ports.
Sockets became stuck in the TCP TIME_WAIT state, preventing new connections.
The system began logging these failures at a rate of millions per second.
The Go runtime spawned thousands of OS threads to handle blocking logging syscalls, leading to massive Garbage Collection (GC) pauses and Out-of-Memory(OOM) crashes.

The service entered a cycle where it would function for approximately 30 minutes before OOM'ing and restarting, only to find its connection pools immediately saturated again.

The root cause was a single missing line of code in an internal Go library. Specifically:

Missing Concurrency Guard: The GetPostRecord endpoint was the only RPC handler in the system missing an errgroup.SetLimit() call.
Logging Overhead: The system used blocking write(2) syscalls for logging. At a scale of millions of errors per second, this caused a thread-count explosion that overwhelmed the Go runtime.
Aggressive Tuning: Environment variables for GOGC and GOMEMLIMIT were tuned so tightly that they provided no buffer for the sudden spike in OS threads and memory pressure.

Timeline

April 3, 22:16 UTC: First "address already in use" errors recorded in backend logs.
April 4 (Saturday): Initial alert/page received. Engineers suspected a network transit issue.
April 5 (Sunday): Service continued to experience intermittent "dips" while troubleshooting continued.
April 6 (Monday): Incident escalated to SEV-1; 50% of users faced intermittent 8-hour downtime.
April 6, 23:00 UTC: Emergency "band-aid" fix deployed (loopback IP rotation); service stabilized.
April 8 (Wednesday): Root cause identified and permanent concurrency limit patched.

Time to Detect (TTD): 2 hours (from the first major Saturday dip)

Time to Resolve (TTR): 50 hours (to full stabilization)

Who was affected?

The outage primarily impacted users served by a specific data center where the new internal service was active.

User Impact: Approximately 50% of the total Bluesky user base experienced intermittent connectivity and feed loading failures.
Service Impact: The AppView data plane suffered recurring OOM crashes and extreme latency during GC pauses.
Developer Impact: Internal teams were unable to rely on the GetPostRecord RPC for cross-service data aggregation.

How did Bluesky respond?

Bluesky responded by implementing a highly unconventional "band-aid" to break the death spiral. Engineers modified the memcached client to use a custom dialer that picked a random loopback IP address (within the 127.0.0.0/8 range) for every connection. This expanded the ephemeral port space from ~65,000 to millions, allowing the service to bypass the TIME_WAIT bottleneck.

Following the incident, Bluesky committed to:

Implementing mandatory errgroup limits across all RPC handlers.
Transitioning from blocking logging to Prometheus-based metrics and OTEL tracing for high-scale errors.
Improving per-client observability to identify "heavy" internal requests instantly.

How did Bluesky communicate?

Communication was handled via the official status page and a detailed technical blog post by Jim Calabro. While the team initially misidentified the issue as a 3rd-party provider fault due to misleading traceroute data, they quickly corrected the record and provided a transparent breakdown of the internal coding error.

Key learnings for other teams

Bound your concurrency: Never assume batch sizes will remain small; always set a hard limit on goroutine creation for network-bound tasks.
Audit your logging: High-frequency logging in error paths can become a performance bottleneck that crashes your runtime.
Expand your metrics: Implement per-client observability to distinguish between general traffic spikes and single-source resource exhaustion.
Beware of TIME_WAIT: In high-throughput environments, the standard ephemeral port range can be a silent killer if connections are not recycled properly.

Quick Summary

Bluesky experienced a SEV-1 outage caused by a missing concurrency limit in a batch request handler. This triggered a TCP port shortage and a logging "death spiral" that paralyzed the Go runtime. The incident was resolved through an emergency loopback IP randomization hack and a subsequent code patch.

How ilert can help

During a complex "death spiral" incident, every minute of investigation counts. ilert helps teams minimize downtime by:

Advanced Alerting: Instantly paging the right backend engineers via voice or SMS when traffic dips occur, bypassing silent dashboards.
Incident Communication: Providing transparent, real-time Status Pages to keep users informed and reduce support ticket volume.
Observability Integration: Linking your metrics directly to on-call schedules so that "port exhaustion" alerts reach the right expert immediately.

Find more Postmortems:

SEV-1

10.06.2026

Gemini: How database hotspotting and a one-minute cache TTL amplified a major outage

On June 10, 2026, Google Gemini experienced a severe availability incident affecting web, mobile, and Chrome-integrated users. For nearly 7 hours, users encountered elevated error rates when sending prompts, culminating in a 50% prompt failure rate at peak impact. An increase in frontend traffic overwhelmed an already heavily utilized backend database service. A database index design flaw concentrated tool-deployment metadata on a few database shards, while a one-minute cache Time To Live (TTL) forced frequent database refreshes. The resulting read contention drove a more than 10x surge in database calls, with database failure rates rising to 60%. This incident highlights how index design, cache behavior, and backend capacity can compound a simple traffic spike into a major customer-facing outage.

SEV-1

19.05.2026

Railway: How a GCP account suspension took Railway down for 8 hours

On May 19, 2026, Railway experienced a platform-wide service disruption lasting approximately eight hours. The SEV-1 outage began when Google Cloud incorrectly placed Railway’s production account into a suspended status. This action instantly disabled Railway’s GCP-hosted infrastructure, taking down the dashboard, API, databases, compute infrastructure, and critical network components. While Railway operates a multi-cloud architecture spanning GCP, AWS, and Railway Metal, the outage quickly cascaded globally. A hidden architectural dependency—a GCP-hosted network control plane—prevented edge proxies from refreshing routing tables. Once cached routes expired, customer services across all cloud providers became unreachable. This incident serves as a stark reminder that multi-cloud infrastructure does not guarantee resilience if a critical control-plane dependency relies on a single provider.

SEV-1

20.02.2026

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

On February 20, 2026, Cloudflare experienced a SEV-1 outage affecting customers using its Bring Your Own IP (BYOIP) service. The incident was triggered by a faulty internal cleanup task that unintentionally withdrew customer IP prefixes from the Internet via BGP. As a result, affected services became unreachable, causing connection timeouts and failures across Cloudflare-powered applications. The outage lasted several hours and highlighted the risks of automated production tasks without strong safeguards, validation, and blast-radius controls.

Ready to elevate your incident management?

Start for free

Bluesky: Decoding the loopback death spiral and missing concurrency limits

Company and product

What happened

Timeline

Who was affected?

How did Bluesky respond?

How did Bluesky communicate?

Key learnings for other teams

Quick Summary

How ilert can help

Gemini: How database hotspotting and a one-minute cache TTL amplified a major outage

Railway: How a GCP account suspension took Railway down for 8 hours

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

The solution for operation teams.