Bluesky: Decoding the loopback death spiral and missing concurrency limits
This article examines the April 2026 Bluesky outage, where a missing concurrency limit in an internal RPC handler triggered port exhaustion, logging overload, and recurring AppView crashes. We explore how large batch requests led to a backend “death spiral,” how engineers stabilized the service, and what teams can learn about concurrency control, observability, and failure isolation.
Company and product
Bluesky is a decentralized social media platform powered by the AT Protocol. It operates an open ecosystem where users can maintain portable identities across different service providers. A core component of this infrastructure is the AppView, which aggregates global network data to serve user feeds and profiles.
To handle massive scale, the AppView utilizes a high-performance data plane that relies on ScyllaDB and a memcached layer. This caching strategy is designed to minimize database load, ensuring that millions of concurrent requests can be served with sub-millisecond latency.
What happened
The incident was triggered by a newly deployed internal service that began sending GetPostRecord requests containing batches of 15,000 to 20,000 URIs. While the request frequency was low (under 3 per second), the sheer size of the batches overwhelmed the system's ability to manage connections.
Because the specific RPC handler lacked a concurrency limit, the system attempted to launch 20,000 simultaneous goroutines per request to query memcached. This led to the following chain of events:
- Thousands of memcached connections were opened and closed rapidly, exhausting available TCP ports.
- Sockets became stuck in the TCP
TIME_WAITstate, preventing new connections. - The system began logging these failures at a rate of millions per second.
- The Go runtime spawned thousands of OS threads to handle blocking logging syscalls, leading to massive Garbage Collection (GC) pauses and
Out-of-Memory(OOM) crashes.
The service entered a cycle where it would function for approximately 30 minutes before OOM'ing and restarting, only to find its connection pools immediately saturated again.
The root cause was a single missing line of code in an internal Go library. Specifically:
- Missing Concurrency Guard: The
GetPostRecordendpoint was the only RPC handler in the system missing anerrgroup.SetLimit()call. - Logging Overhead: The system used blocking
write(2) syscalls for logging. At a scale of millions of errors per second, this caused a thread-count explosion that overwhelmed the Go runtime. - Aggressive Tuning: Environment variables for
GOGCandGOMEMLIMITwere tuned so tightly that they provided no buffer for the sudden spike in OS threads and memory pressure.
Timeline
- April 3, 22:16 UTC: First "
address already in use" errors recorded in backend logs. - April 4 (Saturday): Initial alert/page received. Engineers suspected a network transit issue.
- April 5 (Sunday): Service continued to experience intermittent "dips" while troubleshooting continued.
- April 6 (Monday): Incident escalated to SEV-1; 50% of users faced intermittent 8-hour downtime.
- April 6, 23:00 UTC: Emergency "band-aid" fix deployed (loopback IP rotation); service stabilized.
- April 8 (Wednesday): Root cause identified and permanent concurrency limit patched.
Time to Detect (TTD): 2 hours (from the first major Saturday dip)
Time to Resolve (TTR): 50 hours (to full stabilization)
Who was affected?
The outage primarily impacted users served by a specific data center where the new internal service was active.
- User Impact: Approximately 50% of the total Bluesky user base experienced intermittent connectivity and feed loading failures.
- Service Impact: The AppView data plane suffered recurring
OOMcrashes and extreme latency duringGCpauses. - Developer Impact: Internal teams were unable to rely on the
GetPostRecordRPC for cross-service data aggregation.
How did Bluesky respond?
Bluesky responded by implementing a highly unconventional "band-aid" to break the death spiral. Engineers modified the memcached client to use a custom dialer that picked a random loopback IP address (within the 127.0.0.0/8 range) for every connection. This expanded the ephemeral port space from ~65,000 to millions, allowing the service to bypass the TIME_WAIT bottleneck.
Following the incident, Bluesky committed to:
- Implementing mandatory
errgrouplimits across all RPC handlers. - Transitioning from blocking logging to Prometheus-based metrics and OTEL tracing for high-scale errors.
- Improving per-client observability to identify "heavy" internal requests instantly.
How did Bluesky communicate?
Communication was handled via the official status page and a detailed technical blog post by Jim Calabro. While the team initially misidentified the issue as a 3rd-party provider fault due to misleading traceroute data, they quickly corrected the record and provided a transparent breakdown of the internal coding error.
Key learnings for other teams
- Bound your concurrency: Never assume batch sizes will remain small; always set a hard limit on goroutine creation for network-bound tasks.
- Audit your logging: High-frequency logging in error paths can become a performance bottleneck that crashes your runtime.
- Expand your metrics: Implement per-client observability to distinguish between general traffic spikes and single-source resource exhaustion.
- Beware of TIME_WAIT: In high-throughput environments, the standard ephemeral port range can be a silent killer if connections are not recycled properly.
Quick Summary
Bluesky experienced a SEV-1 outage caused by a missing concurrency limit in a batch request handler. This triggered a TCP port shortage and a logging "death spiral" that paralyzed the Go runtime. The incident was resolved through an emergency loopback IP randomization hack and a subsequent code patch.
How ilert can help
During a complex "death spiral" incident, every minute of investigation counts. ilert helps teams minimize downtime by:
- Advanced Alerting: Instantly paging the right backend engineers via voice or SMS when traffic dips occur, bypassing silent dashboards.
- Incident Communication: Providing transparent, real-time Status Pages to keep users informed and reduce support ticket volume.
- Observability Integration: Linking your metrics directly to on-call schedules so that "port exhaustion" alerts reach the right expert immediately.

