Postmortem Library

Cloudflare: Global outage caused by Bot Management config file

Cloudflare’s worst outage since 2019 halted core traffic on Nov 18, 2025. Here’s what broke, how they fixed it, and the lessons teams can apply.

Link to the source

Company and product

Cloudflare operates a global connectivity cloud that accelerates and secures internet traffic for millions of properties – CDN, DDoS protection, Zero Trust access, DNS, and developer services. Cloudflare’s footprint is hard to overstate. As of December 2025, around one-fifth of all websites (20.5%) run behind Cloudflare’s reverse proxy, giving it around 81% share of the reverse-proxy market tracked by W3Techs. The company’s anycast network spans 300+ cities, interconnects with 12,000+ networks, and can reach up to 95% of the world’s population within around 50 ms. A distribution that turns Cloudflare into a de facto edge layer for a massive slice of the internet.

‍

On ordinary days, Cloudflare handles tens of millions of HTTP requests per second and ingests hundreds of millions of events per second into its analytics pipelines, numbers that underscore how deeply its services sit on the hot path of global traffic. Beyond CDN and security, Cloudflare also operates the public 1.1.1.1 DNS resolver and publishes aggregated insights via Cloudflare Radar, reflecting the platform’s role as both delivery fabric and observability vantage point for the internet.

‍

The net effect: when Cloudflare sneezes, the web feels it – recent outages briefly impacted high-profile apps and sites worldwide, such as Uber, ChatGPT, X – a reminder of the company’s systemic importance.

What happened

On November 18, 2025, Cloudflare’s network began failing to deliver core traffic at 11:20 UTC after a permissions change on a ClickHouse cluster altered the results of a query that generates the Bot Management “feature file.” The change caused duplicate rows, doubling the file’s size and pushing it beyond a hard limit in the core proxy (FL/FL2).

‍

As the oversized file propagated every five minutes, parts of the fleet alternated between “good” and “bad” states until the failure stabilized; proxies on FL2 returned 5xx errors, while the older FL path did not 5xx but produced bot scores of zero, triggering false positives for customers that enforce bot-score rules.

‍

Impact rippled across products: core CDN and security traffic saw widespread 5xxs; Turnstile failed to load (blocking many dashboard logins); Workers KV and dependent systems, including Access, experienced elevated errors until a KV bypass reduced impact; and Email Security temporarily lost access to an IP reputation source, reducing spam-detection accuracy.

‍

An unrelated brief outage of Cloudflare’s off-platform status page initially reinforced an internal hypothesis of a hyper-scale DDoS, but responders isolated the root cause to the malformed feature file and stopped its propagation, rolled back to a known-good artifact, and restarted the proxy, restoring core traffic by 14:30 UTC and fully normalizing by 17:06 UTC.

Timeline

Cloudflare timestamps are UTC.

‍

11:05 – Database access-control change deployed.

11:20 – failures began. The first customer HTTP errors were observed at 11:28. Automated tests flagged issues at 11:31; manual investigation began at 11:32, and an incident bridge was opened at 11:35.

13:05 – Mitigations reduced the impact by bypassing Workers KV and Cloudflare Access to a prior proxy version.

14:24 – 14:30 – Generation and propagation of the bad Bot Management file was halted; a known-good file was validated and rolled out globally. Main impact resolved at 14:30.

17:06 – All downstream services fully restored; impact ends.

‍

TTD (time to detect): ~3 minutes (11:28 → 11:31).

TTR (time to resolve): ~3 hours 2 minutes to main recovery (11:28 → 14:30); ~5 hours 38 minutes to full restoration (11:28 → 17:06).

‍

Affected parties included customers across Cloudflare’s network; the company called it its worst outage since 2019, with most core traffic unable to flow during the peak window.

How Cloudflare responded

Engineering escalated within minutes, formed an incident call, and pursued multiple workstreams in parallel: traffic manipulation and account limiting to protect stressed services, targeted bypasses to earlier proxy versions, and a rollback of the Bot Management configuration to a known-good artifact. Once the offending file was identified as the trigger, the team stopped its creation/propagation, pushed the corrected file, and restarted affected components. Teams then worked through the long tail, ensuring downstream systems recovered cleanly.

Communication during the outage

Cloudflare’s status page, hosted outside Cloudflare's infrastructure, was briefly unreachable due to an unrelated issue, which initially reinforced the DDoS hypothesis inside the response team. After stability returned, Cloudflare published a detailed post-incident report the same day, providing exact UTC timestamps, technical root cause, and commitments to hardening. The candid framing and explicit apology modeled transparent incident communication at an Internet scale.

Key learnings for other companies

Large, frequently updated configuration artifacts are de facto code and deserve the same guardrails. Schema and permission changes can subtly change query results, so feature generators must strictly filter inputs and validate outputs, including size, shape, and semantic checks, before distribution. Global propagation loops should include rapid kill-switches and staged rollouts with blast-radius limits; alternating “good/bad” versions across a partially upgraded fleet can make symptoms look like an attack and slow triage. Keep out-of-band monitoring for your status site and ensure it is genuinely independent (hosted off-platform, different providers, and continuously synthetic-tested) to avoid compounding confusion. Finally, measure and rehearse recovery of core proxies under configuration-load failure, and treat every config pipeline with artifact registries, version pinning, and rollback-first design. These are practical paths to shrinking both TTD and TTR.

Quick summary

A change to database permissions altered the results of a ClickHouse query used to build a Bot Management feature file. The file doubled in size, exceeded a proxy limit, and, once propagated, caused widespread 5xxs. Detection came within three minutes; main recovery in just over three hours; full recovery in under six. Cloudflare halted propagation, restored a known-good artifact, restarted services, and documented the event in depth, calling it their worst outage since 2019.

How ilert can help

ilert reduces the mean time it takes to notice, mobilize, fix, and inform.

‍

Faster detection: Unified alerting from synthetic checks and telemetry catches anomalies like rising 5xxs within minutes, then auto-routes to the on-call via voice, SMS, push, and chat.
Smart escalation and collaboration: On-call schedules and escalation policies spin up the right responders and a shared timeline instantly. No guesswork when seconds matter.
Change-aware response: Link alerts to deployment events so responders and AI agents can correlate symptoms with potential triggers immediately.
Guardrails for comms: A hosted, independent ilert status page keeps customers informed even if your primary stack is degraded, with AI-assisted updates and stakeholder notifications.
Postmortems that stick: Generate structured PIRs with timelines, action items, and ownership to make preventive controls such as validation, rollout gates, and kill-switches part of daily practice.

Find more Postmortems:

SEV-1

29.01.2026

OpenClaw: How a malicious URL enabled total AI agent takeover

A critical vulnerability (CVE-2026-25253) in OpenClaw allowed attackers to execute code and steal authentication tokens by directing AI agents to malicious URLs, leading to full compromise of self-hosted deployments.

SEV-1

08.01.2026

n8n: How the "Ni8mare" flaw left 100,000 servers open to total takeover

In early 2026, n8n disclosed a critical security vulnerability, tracked as CVE-2026-21858 and nicknamed "Ni8mare." The flaw allowed unauthenticated attackers to achieve Remote Code Execution (RCE) on self-hosted instances. By exploiting a "Content-Type" confusion bug in how n8n handled webhooks and form submissions, attackers could bypass authentication, read sensitive server files, and ultimately gain full control over the host system.

SEV-1

09.01.2026

Intercom: Empty database routing map triggers total US service outage

This article examines the January 2026 Intercom outage in the US region, where a database routing failure caused a total service blackout. We explore how a logic bug in a sharded database layer can disconnect an entire application from its data.

Ready to elevate your incident management?

Start for free

Cloudflare: Global outage caused by Bot Management config file

Company and product

What happened

Timeline

How Cloudflare responded

Communication during the outage

Key learnings for other companies

Quick summary

How ilert can help

OpenClaw: How a malicious URL enabled total AI agent takeover

n8n: How the "Ni8mare" flaw left 100,000 servers open to total takeover

Intercom: Empty database routing map triggers total US service outage

The solution for operation teams.