Postmortem Library

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

This article examines Cloudflare’s February 2026 outage, where a faulty internal cleanup task unintentionally withdrew customer BYOIP routes from the Internet, causing affected services to become unreachable.

Link to the source

Company and product

Cloudflare is a global Internet infrastructure provider offering CDN, security, and networking services. Its BYOIP feature allows customers to advertise their own IP ranges through Cloudflare’s network. This functionality is critical for organizations that require control over IP ownership while leveraging Cloudflare’s global routing, security, and performance capabilities. Because Cloudflare operates at the network layer, issues affecting IP advertisement can directly impact reachability across the Internet.

What happened

The incident was triggered by a newly deployed automated cleanup task designed to remove unused BYOIP prefixes. Due to a bug in how the system queried the API, the cleanup process incorrectly identified all BYOIP prefixes as candidates for deletion. Instead of targeting a limited subset, the task began withdrawing prefixes at scale.

This led to the following chain of events:

Customer IP prefixes were withdrawn from BGP
Traffic was no longer routed to Cloudflare
End users experienced connection failures and timeouts
Some services, such as Cloudflare’s public resolver interface, returned errors

Affected users initially experienced BGP path hunting, where traffic attempts to find alternative routes before eventually failing.

The impact varied:

Some customers could restore service manually via the dashboard
Others required deeper recovery due to removed configurations
A subset required full restoration of service bindings across the edge

The root cause was a bug in an internal cleanup task interacting with Cloudflare’s Addressing API.

Specifically:

The task queried an API endpoint using a parameter (pending_delete)
Due to improper handling of empty values, the API returned all prefixes instead of only those pending deletion
The system interpreted this as an instruction to remove all BYOIP prefixes

This resulted in a large-scale unintended withdrawal of IP routes.

Several contributing factors increased the impact:

Lack of safeguards for large-scale deletion operations
Insufficient test coverage for automated background tasks
Tight coupling between configuration state and operational execution

This combination allowed a single faulty process to propagate changes directly to production infrastructure.

Timeline

Feb 5, 2026: Faulty cleanup sub-task code merged into production codebase
17:46 UTC: Code deployed to production (Addressing API release)
17:56 UTC: Prefix withdrawals begin: impact starts
18:13 UTC: Cloudflare engaged after detecting failures on 1.1.1.1
18:18 UTC: Internal incident declared
18:21 UTC: Addressing API team engaged and investigation begins
18:46 UTC: Faulty process identified and disabled: prefix withdrawals stop
19:11 UTC: Restoration efforts begin for withdrawn and removed prefixes
19:19 UTC: Customers begin self-mitigation via dashboard (re-advertising prefixes)
19:44 UTC: Additional mitigation: database recovery for removed prefixes
20:30 UTC: Majority of withdrawn prefixes restored (partial recovery)
21:08 UTC: Global configuration rollout to restore remaining prefixes
23:03 UTC: Full restoration completed: Incident resolved‍

Time to Detect (TTD): 17 minutes
Time to Resolve (TTR): 5 hours and 7 minutes (from detection)‍

Who was affected?

The incident primarily impacted customers utilizing Cloudflare’s Bring Your Own IP (BYOIP) service. The scope of the impact is quantified as follows:

Prefix Withdrawal: Out of 4,306 total BYOIP prefixes advertised globally, 1,100 prefixes (approximately 25%) were unintentionally withdrawn from the Cloudflare network.
Infrastructure Impact: A total of 6,500 prefixes were monitored at a specific BGP peer during the event; the 1,100 withdrawn prefixes represented a significant portion of Cloudflare's total advertised reach to that neighbor.
Customer Subsets:
- 800 prefixes were restored relatively early (by 20:20 UTC) via dashboard toggles or automated reverts.
- 300 prefixes suffered a more severe impact because their service configurations were completely removed from the edge, preventing self-remediation and requiring manual restoration by engineers.
Service-Specific Failures:
- Magic Transit & Spectrum: Applications protected by these services were not advertised on the Internet, resulting in connection timeouts.
- Core CDN & Security: Traffic was not "attracted" to Cloudflare, meaning websites on these ranges were unreachable.
- Dedicated Egress: Users relying on BYOIP for outbound traffic (Gateway) were unable to reach destinations.
- 1.1.1.1 Information Site: While DNS resolution itself remained functional, the one.one.one.one website returned HTTP 403 "Edge IP Restricted" errors.‍

How did Cloudflare respond?

Cloudflare responded by:

Identifying and disabling the faulty cleanup task
Reverting prefix withdrawals
Enabling customer self-recovery via dashboard actions
Performing manual restoration for affected configurations

Following the incident, Cloudflare introduced several remediation measures:

Improving API schema validation
Introducing safer rollback mechanisms
Enhancing monitoring for large-scale configuration changes
Implementing circuit breakers to prevent excessive automated actions

These efforts were aligned with their “Code Orange: Fail Small” initiative, aimed at reducing blast radius and improving deployment safety.

How did Cloudflare communicate?

Cloudflare communicated transparently through a detailed public postmortem.

The report included:

A full technical breakdown of the issue
A precise timeline of events
Clear explanation of system behavior
Concrete remediation steps

The tone emphasized accountability and learning, reinforcing Cloudflare’s commitment to improving system reliability.

Key learnings for other teams

Automated processes require strict safeguards. Background tasks that operate on production data must include strong validation, scope limitations, and safety checks before making large-scale changes.
Small bugs can have global impact. In distributed infrastructure, a minor logic error or incorrect API query can quickly propagate and affect large portions of production systems.
Blast radius must be controlled. High-impact operations such as deleting configurations or withdrawing routes should include circuit breakers, gradual rollouts, and automatic stop conditions.
Separate desired state from operational state. When configuration state is tightly coupled with execution, recovery becomes harder because a mistaken change can directly affect live production behavior.
Test beyond user workflows. Testing should cover automated background jobs and system-initiated actions, not only customer-facing or manually triggered workflows.
Rollback paths must be fast and reliable. Teams need clear recovery mechanisms for restoring deleted configurations, re-advertising routes, and reversing large-scale infrastructure changes quickly.

‍Quick Summary

Cloudflare experienced a SEV-1 outage caused by a faulty cleanup task that unintentionally withdrew customer IP prefixes from the Internet. This led to widespread reachability issues for BYOIP customers. The incident highlights the risks of automated operations without sufficient safeguards and the importance of limiting blast radius in production systems.

How ilert can help

During a complex BGP or API outage, seconds count. ilert helps teams minimize downtime through and respond effectively by enabling:

Advanced Alerting: Ensure the right network engineers are paged immediately via voice, SMS, or push notifications, bypassing "email silos."
Incident Communication: Keep customers informed with branded Status Pages that update automatically, reducing the load on support teams during a crisis.
On-Call Scheduling: Manage global rotations so there is always a "Code Orange" expert ready to respond to high-severity incidents.

Find more Postmortems:

SEV-1

10.03.2026

Clerk: How a hidden cloud database migration caused a full authentication outage

On March 10, 2026, Clerk experienced a critical outage triggered by a failed Google Cloud SQL live migration. The incident introduced severe disk latency, which led to database lock contention, compute saturation, and widespread API failures. As authentication requests began returning errors, users were unable to log in or manage sessions, resulting in a full service disruption.

SEV-1

29.01.2026

OpenClaw: How a malicious URL enabled total AI agent takeover

A critical vulnerability (CVE-2026-25253) in OpenClaw allowed attackers to execute code and steal authentication tokens by directing AI agents to malicious URLs, leading to full compromise of self-hosted deployments.

SEV-1

08.01.2026

n8n: How the "Ni8mare" flaw left 100,000 servers open to total takeover

In early 2026, n8n disclosed a critical security vulnerability, tracked as CVE-2026-21858 and nicknamed "Ni8mare." The flaw allowed unauthenticated attackers to achieve Remote Code Execution (RCE) on self-hosted instances. By exploiting a "Content-Type" confusion bug in how n8n handled webhooks and form submissions, attackers could bypass authentication, read sensitive server files, and ultimately gain full control over the host system.

Ready to elevate your incident management?

Start for free

Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes

Company and product

What happened

Timeline

Who was affected?

How did Cloudflare respond?

How did Cloudflare communicate?

Key learnings for other teams

‍Quick Summary

How ilert can help

Clerk: How a hidden cloud database migration caused a full authentication outage

OpenClaw: How a malicious URL enabled total AI agent takeover

n8n: How the "Ni8mare" flaw left 100,000 servers open to total takeover

The solution for operation teams.