Postmortem Library

Bluesky: Decoding the loopback death spiral and missing concurrency limits

This article examines the April 2026 Bluesky outage, where a missing concurrency limit in an internal RPC handler triggered port exhaustion, logging overload, and recurring AppView crashes. We explore how large batch requests led to a backend “death spiral,” how engineers stabilized the service, and what teams can learn about concurrency control, observability, and failure isolation.

Company and product

Bluesky is a decentralized social media platform powered by the AT Protocol. It operates an open ecosystem where users can maintain portable identities across different service providers. A core component of this infrastructure is the AppView, which aggregates global network data to serve user feeds and profiles.

To handle massive scale, the AppView utilizes a high-performance data plane that relies on ScyllaDB and a memcached layer. This caching strategy is designed to minimize database load, ensuring that millions of concurrent requests can be served with sub-millisecond latency.

What happened

The incident was triggered by a newly deployed internal service that began sending GetPostRecord requests containing batches of 15,000 to 20,000 URIs. While the request frequency was low (under 3 per second), the sheer size of the batches overwhelmed the system's ability to manage connections.

Because the specific RPC handler lacked a concurrency limit, the system attempted to launch 20,000 simultaneous goroutines per request to query memcached. This led to the following chain of events:

  • Thousands of memcached connections were opened and closed rapidly, exhausting available TCP ports.
  • Sockets became stuck in the TCP TIME_WAIT state, preventing new connections.
  • The system began logging these failures at a rate of millions per second.
  • The Go runtime spawned thousands of OS threads to handle blocking logging syscalls, leading to massive Garbage Collection (GC) pauses and Out-of-Memory(OOM) crashes.

The service entered a cycle where it would function for approximately 30 minutes before OOM'ing and restarting, only to find its connection pools immediately saturated again.

The root cause was a single missing line of code in an internal Go library. Specifically:

  • Missing Concurrency Guard: The GetPostRecord endpoint was the only RPC handler in the system missing an errgroup.SetLimit() call.
  • Logging Overhead: The system used blocking write(2) syscalls for logging. At a scale of millions of errors per second, this caused a thread-count explosion that overwhelmed the Go runtime.
  • Aggressive Tuning: Environment variables for GOGC and GOMEMLIMIT were tuned so tightly that they provided no buffer for the sudden spike in OS threads and memory pressure.

Timeline

  • April 3, 22:16 UTC: First "address already in use" errors recorded in backend logs.
  • April 4 (Saturday): Initial alert/page received. Engineers suspected a network transit issue.
  • April 5 (Sunday): Service continued to experience intermittent "dips" while troubleshooting continued.
  • April 6 (Monday): Incident escalated to SEV-1; 50% of users faced intermittent 8-hour downtime.
  • April 6, 23:00 UTC: Emergency "band-aid" fix deployed (loopback IP rotation); service stabilized.
  • April 8 (Wednesday): Root cause identified and permanent concurrency limit patched.

Time to Detect (TTD): 2 hours (from the first major Saturday dip)

Time to Resolve (TTR): 50 hours (to full stabilization)

Who was affected?

The outage primarily impacted users served by a specific data center where the new internal service was active.

  • User Impact: Approximately 50% of the total Bluesky user base experienced intermittent connectivity and feed loading failures.
  • Service Impact: The AppView data plane suffered recurring OOM crashes and extreme latency during GC pauses.
  • Developer Impact: Internal teams were unable to rely on the GetPostRecord RPC for cross-service data aggregation.

How did Bluesky respond?

Bluesky responded by implementing a highly unconventional "band-aid" to break the death spiral. Engineers modified the memcached client to use a custom dialer that picked a random loopback IP address (within the 127.0.0.0/8 range) for every connection. This expanded the ephemeral port space from ~65,000 to millions, allowing the service to bypass the TIME_WAIT bottleneck.

Following the incident, Bluesky committed to:

  • Implementing mandatory errgroup limits across all RPC handlers.
  • Transitioning from blocking logging to Prometheus-based metrics and OTEL tracing for high-scale errors.
  • Improving per-client observability to identify "heavy" internal requests instantly.

How did Bluesky communicate?

Communication was handled via the official status page and a detailed technical blog post by Jim Calabro. While the team initially misidentified the issue as a 3rd-party provider fault due to misleading traceroute data, they quickly corrected the record and provided a transparent breakdown of the internal coding error.

Key learnings for other teams

  • Bound your concurrency: Never assume batch sizes will remain small; always set a hard limit on goroutine creation for network-bound tasks.
  • Audit your logging: High-frequency logging in error paths can become a performance bottleneck that crashes your runtime.
  • Expand your metrics: Implement per-client observability to distinguish between general traffic spikes and single-source resource exhaustion.
  • Beware of TIME_WAIT: In high-throughput environments, the standard ephemeral port range can be a silent killer if connections are not recycled properly.

Quick Summary

Bluesky experienced a SEV-1 outage caused by a missing concurrency limit in a batch request handler. This triggered a TCP port shortage and a logging "death spiral" that paralyzed the Go runtime. The incident was resolved through an emergency loopback IP randomization hack and a subsequent code patch.

How ilert can help

During a complex "death spiral" incident, every minute of investigation counts. ilert helps teams minimize downtime by:

  • Advanced Alerting: Instantly paging the right backend engineers via voice or SMS when traffic dips occur, bypassing silent dashboards.
  • Incident Communication: Providing transparent, real-time Status Pages to keep users informed and reduce support ticket volume.
  • Observability Integration: Linking your metrics directly to on-call schedules so that "port exhaustion" alerts reach the right expert immediately.
Find more Postmortems:
Ready to elevate your incident management?
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Our Cookie Policy
We use cookies to improve your experience, analyze site traffic and for marketing. Learn more in our Privacy Policy.
Open Preferences
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.