Cloudflare: How a faulty cleanup task unintentionally withdrew Internet routes
This article examines Cloudflare’s February 2026 outage, where a faulty internal cleanup task unintentionally withdrew customer BYOIP routes from the Internet, causing affected services to become unreachable.
Company and product
Cloudflare is a global Internet infrastructure provider offering CDN, security, and networking services. Its BYOIP feature allows customers to advertise their own IP ranges through Cloudflare’s network. This functionality is critical for organizations that require control over IP ownership while leveraging Cloudflare’s global routing, security, and performance capabilities. Because Cloudflare operates at the network layer, issues affecting IP advertisement can directly impact reachability across the Internet.
What happened
The incident was triggered by a newly deployed automated cleanup task designed to remove unused BYOIP prefixes. Due to a bug in how the system queried the API, the cleanup process incorrectly identified all BYOIP prefixes as candidates for deletion. Instead of targeting a limited subset, the task began withdrawing prefixes at scale.
This led to the following chain of events:
- Customer IP prefixes were withdrawn from BGP
- Traffic was no longer routed to Cloudflare
- End users experienced connection failures and timeouts
- Some services, such as Cloudflare’s public resolver interface, returned errors
Affected users initially experienced BGP path hunting, where traffic attempts to find alternative routes before eventually failing.
The impact varied:
- Some customers could restore service manually via the dashboard
- Others required deeper recovery due to removed configurations
- A subset required full restoration of service bindings across the edge
The root cause was a bug in an internal cleanup task interacting with Cloudflare’s Addressing API.
Specifically:
- The task queried an API endpoint using a parameter (
pending_delete) - Due to improper handling of empty values, the API returned all prefixes instead of only those pending deletion
- The system interpreted this as an instruction to remove all BYOIP prefixes
This resulted in a large-scale unintended withdrawal of IP routes.
Several contributing factors increased the impact:
- Lack of safeguards for large-scale deletion operations
- Insufficient test coverage for automated background tasks
- Tight coupling between configuration state and operational execution
This combination allowed a single faulty process to propagate changes directly to production infrastructure.
Timeline
- Feb 5, 2026: Faulty cleanup sub-task code merged into production codebase
- 17:46 UTC: Code deployed to production (Addressing API release)
- 17:56 UTC: Prefix withdrawals begin: impact starts
- 18:13 UTC: Cloudflare engaged after detecting failures on 1.1.1.1
- 18:18 UTC: Internal incident declared
- 18:21 UTC: Addressing API team engaged and investigation begins
- 18:46 UTC: Faulty process identified and disabled: prefix withdrawals stop
- 19:11 UTC: Restoration efforts begin for withdrawn and removed prefixes
- 19:19 UTC: Customers begin self-mitigation via dashboard (re-advertising prefixes)
- 19:44 UTC: Additional mitigation: database recovery for removed prefixes
- 20:30 UTC: Majority of withdrawn prefixes restored (partial recovery)
- 21:08 UTC: Global configuration rollout to restore remaining prefixes
- 23:03 UTC: Full restoration completed: Incident resolved
Time to Detect (TTD): 17 minutes
Time to Resolve (TTR): 5 hours and 7 minutes (from detection)
Who was affected?
The incident primarily impacted customers utilizing Cloudflare’s Bring Your Own IP (BYOIP) service. The scope of the impact is quantified as follows:
- Prefix Withdrawal: Out of 4,306 total BYOIP prefixes advertised globally, 1,100 prefixes (approximately 25%) were unintentionally withdrawn from the Cloudflare network.
- Infrastructure Impact: A total of 6,500 prefixes were monitored at a specific BGP peer during the event; the 1,100 withdrawn prefixes represented a significant portion of Cloudflare's total advertised reach to that neighbor.
- Customer Subsets:
- 800 prefixes were restored relatively early (by 20:20 UTC) via dashboard toggles or automated reverts.
- 300 prefixes suffered a more severe impact because their service configurations were completely removed from the edge, preventing self-remediation and requiring manual restoration by engineers.
- Service-Specific Failures:
- Magic Transit & Spectrum: Applications protected by these services were not advertised on the Internet, resulting in connection timeouts.
- Core CDN & Security: Traffic was not "attracted" to Cloudflare, meaning websites on these ranges were unreachable.
- Dedicated Egress: Users relying on BYOIP for outbound traffic (Gateway) were unable to reach destinations.
1.1.1.1Information Site: While DNS resolution itself remained functional, theone.one.one.onewebsite returnedHTTP 403"Edge IP Restricted" errors.
How did Cloudflare respond?
Cloudflare responded by:
- Identifying and disabling the faulty cleanup task
- Reverting prefix withdrawals
- Enabling customer self-recovery via dashboard actions
- Performing manual restoration for affected configurations
Following the incident, Cloudflare introduced several remediation measures:
- Improving API schema validation
- Introducing safer rollback mechanisms
- Enhancing monitoring for large-scale configuration changes
- Implementing circuit breakers to prevent excessive automated actions
These efforts were aligned with their “Code Orange: Fail Small” initiative, aimed at reducing blast radius and improving deployment safety.
How did Cloudflare communicate?
Cloudflare communicated transparently through a detailed public postmortem.
The report included:
- A full technical breakdown of the issue
- A precise timeline of events
- Clear explanation of system behavior
- Concrete remediation steps
The tone emphasized accountability and learning, reinforcing Cloudflare’s commitment to improving system reliability.
Key learnings for other teams
- Automated processes require strict safeguards. Background tasks that operate on production data must include strong validation, scope limitations, and safety checks before making large-scale changes.
- Small bugs can have global impact. In distributed infrastructure, a minor logic error or incorrect API query can quickly propagate and affect large portions of production systems.
- Blast radius must be controlled. High-impact operations such as deleting configurations or withdrawing routes should include circuit breakers, gradual rollouts, and automatic stop conditions.
- Separate desired state from operational state. When configuration state is tightly coupled with execution, recovery becomes harder because a mistaken change can directly affect live production behavior.
- Test beyond user workflows. Testing should cover automated background jobs and system-initiated actions, not only customer-facing or manually triggered workflows.
- Rollback paths must be fast and reliable. Teams need clear recovery mechanisms for restoring deleted configurations, re-advertising routes, and reversing large-scale infrastructure changes quickly.
Quick Summary
Cloudflare experienced a SEV-1 outage caused by a faulty cleanup task that unintentionally withdrew customer IP prefixes from the Internet. This led to widespread reachability issues for BYOIP customers. The incident highlights the risks of automated operations without sufficient safeguards and the importance of limiting blast radius in production systems.
How ilert can help
During a complex BGP or API outage, seconds count. ilert helps teams minimize downtime through and respond effectively by enabling:
- Advanced Alerting: Ensure the right network engineers are paged immediately via voice, SMS, or push notifications, bypassing "email silos."
- Incident Communication: Keep customers informed with branded Status Pages that update automatically, reducing the load on support teams during a crisis.
- On-Call Scheduling: Manage global rotations so there is always a "Code Orange" expert ready to respond to high-severity incidents.

