PagerDuty: Customers experienced delayed notifications
Kafka producer storm caused two PagerDuty outages on Aug 28, 2025. Here’s what failed, impact, timeline, fixes – and how to harden your own response.
Company and product
PagerDuty is a cloud-based incident management and operations platform used by engineering, SRE, and on-call teams to detect, triage, and resolve production issues. Its core value lies in real-time event ingestion, intelligent routing, and multichannel alert delivery, integrated with monitoring, CI/CD, and collaboration tools. The platform orchestrates on-call schedules, escalations, runbooks, and stakeholder communications to reduce MTTA/MTTR at scale for enterprise customers. PagerDuty is a direct competitor to ilert, which makes learning from this incident especially important for us and our community.
What happened
A new API logging feature had a bug that created a brand-new Kafka producer for every API request instead of reusing one. That flooded Kafka with millions of producer connections per hour, overwhelmed broker memory, and destabilized the cluster, which then caused knock-on failures in services that depend on it.
Result: Customers saw delayed notifications, plus intermittent API errors (timeouts and 5xx). Webhooks and chat integrations, such as Slack and Microsoft Teams, experienced lag, and some notifications were duplicated during the recovery process. Incident creation and updates slowed, escalations took longer, and status updates occasionally arrived delayed. No accepted data was lost, but backlogs increased MTTA and MTTR, with the biggest impact in US regions.
Timeline
Aug 28, 2025. 03:53 UTC (The first incident starts) → 10:10 UTC (fully resolved)
TTD: Not published.
TTR: 6h 17m
Aug 28, 2025. 16:38 UTC (The second incident starts) → ~17:28 UTC (impact materially mitigated) → 20:24 UTC (fully resolved)
TTD: Not published.
TTR: 3h 46m (16:38 → 20:24). The team reduced customer impact within ~50 minutes, then fully restored service by 20:24.
Who was affected?
Customers in the US Service Regions experienced degraded event ingestion and API errors; outbound notifications, webhooks, chat integrations (Slack and Microsoft Teams), and REST API were intermittently impacted. No previously accepted data was lost; however, backlogs caused delayed timelines and some duplicate webhooks during the catch-up process.
How the company responded
To stabilize the system, the team focused on Kafka first. They added capacity, took one problematic broker out of service, and increased the memory available to the remaining brokers. These changes were applied with rolling restarts so the cluster never went fully offline. The extra headroom reduced memory pressure and let Kafka handle traffic again while the team continued the investigation.
Next, they investigated the abnormal surge in Kafka producer activity. When a smaller recurrence appeared later that day, the signal was clear: a recently introduced API usage/auditing feature was spinning up a new producer for every request. The team quickly disabled and then rolled back that feature, which eliminated the producer storm at its source and stabilized producer metadata load across the cluster.
With the root cause removed, they focused on recovery. Backlogs in queues and downstream services were drained in a controlled manner to prevent secondary spikes. Systems dependent on Kafka (APIs, webhooks, and chat integrations) were verified as healthy, and duplicate notifications generated during catch-up were contained. Only after end-to-end checks and normal latencies were restored did the team declare the issue fully resolved.
How the company communicated
A publishing dependency initially blocked automated status page updates. PagerDuty team executed a manual update path and later removed the dependency, republished the manual runbook, and committed to more frequent live drills for status communications.
Key learnings for others
- Guardrails for producer/client lifecycles. Use one long-lived Kafka producer per service instance and make that a hard rule in code. Add simple checks in CI that fail the build if a new producer is created per request or in other hot paths. Add a small “early-warning” check in staging and at the edge of production that watches producer counts after each deploy; if they spike, stop the rollout automatically.
- Watch the right capacity signals. Treat broker metadata health and free memory as top-level SLOs, not nice-to-have graphs. Track active producer count, broker heap usage, controller queue depth, and producer-ID expiration. Alert on sudden jumps tied to a release; those step changes are often the first sign something is wrong, even before error rates climb.
- Progressive delivery with auto-rollback. Tie feature flags to objective guardrails: API 5xx rate, enqueue latency, broker heap headroom, and backlog growth. Ramp traffic gradually (e.g., 1% → 5% → 25% → 50%) and let automation roll back the flag the moment any guardrail is breached. This removes the human-in-the-loop delay when seconds matter.
- Blast-radius isolation. Design ingestion so one misbehaving feature cannot take the whole pipeline down. Use separate Kafka clusters or namespaces for experimental/low-trust features, apply backpressure at service boundaries, and add circuit breakers that trip before the shared bus is overwhelmed. Prefer asynchronous retries with jitter to avoid synchronized storms during recovery.
- Cascading-failure playbooks. Pre-write the moves you’ll need under pressure: how to raise broker heap safely, rotate or drain a suspect broker, increase partitions, or pause specific producers. Keep one-click runbooks for rolling restarts of Kafka-dependent services and practice them in game days so the choreography is muscle memory, not improvisation.
- Outage communications resilience. Make the status page publish even when upstream systems are shaky. Remove fragile dependencies from the publishing path, keep a verified manual posting method, and rehearse it quarterly. Pre-draft external templates (“degradation,” “delays,” “duplicate notifications”) so updates go out quickly and consistently while engineering stabilizes the platform.
Quick summary
The outage stemmed from a faulty feature that spawned a new Kafka producer on every API request, peaking at roughly 4.2 million extra producers per hour, which exhausted broker heap and triggered a cascading failure across the cluster.
Customers experienced rejected or delayed events, degraded REST endpoints, slowed webhooks and chat integrations, and duplicate notifications during catch-up, though no previously accepted data was lost. The team stabilized the platform by expanding broker heap and executing rolling restarts, then eliminated the recurrence by rolling back the offending feature. Early in the incident, automated status updates failed to publish, so the team switched to a manual path and later fixed the dependency to harden communications.
How ilert can help improve incident response
- Keep notifications safe by running it on a protected, standalone layer in ilert. If you are an incident response company using your own product for alerting, consider adding a final layer of safety by using ilert.
- Multiple ways to reach people. Plus an emergency hotline. Send every alert through several paths, such as app push, phone call, SMS, email, and chat tool, using more than one provider. Keep trying until someone responds. Add an emergency hotline number that calls the on-call team directly if other paths aren’t working.
- Status communications that don’t depend on fragile links. Automate stakeholder and status page updates from ilert’s alert actions, but keep a verified manual publishing mode. This mirrors the PagerDuty lesson where a publishing dependency blocked early updates.
- Guardrails tied to deploys and feature flags. The root cause was a feature that exploded producer cardinality. Connect ilert to CI/CD and feature flags to watch deploy-time SLOs (enqueue latency, 5xx rate, backlog growth) and escalate or trigger rollback when thresholds are breached.
- Postmortems that drive completion. Generate a structured postmortem, assign corrective actions (e.g., producer-lifecycle linting, metadata SLOs, status comms hardening), and track them to closure so fixes don’t drift.
