Postmortem Library

Cloudflare: Third party storage failure event

Cloudflare faced a major global outage caused by a third-party storage failure, impacting critical services and customers worldwide. Learn exactly what went wrong, Cloudflare's immediate response, remediation efforts, and essential reliability lessons for future resilience.

Link to the source

Company

Cloudflare is a global leader in internet security, performance, and reliability services. Its platform protects and accelerates millions of websites, APIs, and internet applications by providing an integrated suite of tools, including a world-class Content Delivery Network (CDN), DNS resolution, DDoS mitigation, and Zero Trust security. Developers and enterprises alike rely on products like Workers and Workers KV for serverless computing, edge storage, and distributed application logic.

What happened during the Cloudflare outage?

Cloudflare faced a high-severity outage lasting 2 hours and 28 minutes, caused by a failure in the third-party storage backend powering Workers KV – a critical dependency for many core services.

‍

The outage impacted global customers using Cloudflare’s Zero Trust, serverless and edge platforms, disrupting services such as Access, WARP, Gateway, Turnstile, Stream, Workers AI, and parts of the Dashboard. Some experienced near-total failure (e.g., 100% errors in Access logins and Stream Live), while others saw partial degradation (e.g., 97% success rate in Images, elevated CDN latency). Although DNS, Magic WAN, Transit, and WAF remained online, downstream effects were observed.

Timeline

When did the Cloudflare incident start?

‍

The incident began at 17:52 UTC, when Cloudflare’s WARP team detected new device registration failures. Internal alerts fired quickly, and by 18:05 UTC, the Access team was paged due to spiking error rates. With SLOs breaching across services, the incident was declared a P1 by 18:06 UTC.

‍

How was the Cloudflare incident detected and escalated?

‍

As the global impact became clear, it was escalated to a P0 at 18:21 UTC. Mitigation began shortly after: by 18:43 UTC, Access was being decoupled from Workers KV, followed by graceful degradation in Gateway at 19:09 UTC and load shedding at 19:32 UTC.

‍

When was the Cloudflare outage resolved?

‍

Recovery began at 20:23 UTC as the storage vendor came back online. Access and Device Posture resumed normal operation by 20:25 UTC, and the incident was fully resolved at 20:28 UTC.
MTTD: ~13 minutes
MTTR: ~2 hours 36 minutes

How did Cloudflare respond to the outage?

Cloudflare’s engineering team initiated escalation and followed runbooks for critical systems. While mitigation was fast, detection began from downstream failures, and tooling for recovery was still under development. Future improvements, as outlined in the postmortem blog, aim to reduce response lag through better observability and fallback design. The team’s response reflected a blameless, learning-focused culture.

Who was affected by the Cloudflare outage, and how bad was it?

The outage disrupted a large portion of Cloudflare’s ecosystem, affecting both enterprise and end users worldwide. While DNS, Cache, and WAF remained stable, many Zero Trust, serverless, and developer products experienced full or near-total outages.

‍

Most severely affected:

‍

Access: 100% failure for all identity-based logins.
WARP: New client connections failed; policy checks broke for many users.
Stream Live & Workers AI: Up to 100% error rates.
Pages Builds, Zaraz, AutoRAG, Queues, Browser Rendering: Fully offline.
Realtime SFU & TURN: Traffic dropped to 20%.
Gateway, Dashboard, Turnstile: Heavily degraded in identity-related features.

Partial failures:

‍

Cloudflare Images: 97% success rate, but batch uploads failed.
CDN: Mostly stable, with localized latency and routing issues in select cities.
Durable Objects & D1: Error rates peaked at 22% before recovery.

How did Cloudflare communicate during the outage?

Cloudflare maintained steady, transparent communication throughout the outage, primarily via its status page and a later postmortem. Updates were timely and adjusted as severity escalated. While the communication flow was clear across teams, users relying on indirect dependencies may not have had early insight into the root cause, and secure-by-design policies led to brief disruption without fallback paths.

‍

As Google’s SRE playbook highlights, strong incident communication requires not just accuracy but also timing, tone, and empathy. Cloudflare delivered on most of these, but edge-case visibility could still be improved.

What patterns did the Cloudflare outage reveal?

The outage highlighted some recurring risks in distributed systems:

Single vendor dependency. Relying on one provider for key infrastructure can lead to wide impact.
Fail-closed without fallback. Security-first systems need graceful degrade paths to avoid full lockout.
Incomplete migrations. Ongoing architecture shifts can expose gaps in failover readiness.
Cache recovery strain.Uncontrolled cache restarts may overload systems during recovery.

Quick summary

On June 12, 2025, Cloudflare experienced a 2.5-hour global outage due to a failure in its third-party storage backend supporting Workers KV. The incident disrupted core services like Access, WARP, Workers AI, and Stream, with some products experiencing 100% failure rates. Cloudflare responded with fast escalation, mitigation steps, and transparent communication. The outage highlighted key resilience risks in cloud architecture, including single vendor dependency and fail-closed design.

Find more Postmortems:

SEV-1

28.08.2025

PagerDuty: Customers experienced delayed notifications

A logging bug triggered millions of Kafka producer connections per hour, overloading PagerDuty’s cluster and causing API errors, delayed notifications, and slower incident handling, especially in US regions.

SEV-1

20.10.2025

AWS: US-EAST-1 load-balancer glitch sparks internet-scale outage

A major US-EAST-1 incident at AWS triggered a global disruption on October 20–21, degrading or taking offline thousands of apps and sites across social, finance, gaming, government, retail, and more.

SEV-0

18.09.2025

Optus: Fatal impact of firewall upgrade on 000 Calls

A critical Optus upgrade fault blocked ~600 emergency “000” calls for ~13 hours across multiple states – an incident linked to multiple deaths. What we learned.

Ready to elevate your incident management?

Start for free