Cloudflare outage proves Plan B depends on controlling DNS

Tue 18 November 2025

Co-Founder

4 min read

On Tuesday, 18 November 2025, Cloudflare’s own status page marked every major service—CDN, Firewall, WARP, Workers, and the dashboard—as degraded for most of the day while engineers worked through an internal control-plane failure. The timeline moved from “Investigating” at 11:48 UTC to “Monitoring” after 14:42 UTC, and the incident wasn’t officially resolved until 19:28 UTC. During the worst of it, Cloudflare disabled WARP in London, bot scores seesawed, and customers were told to wait while remediation continued.

Waiting was the only option for many teams because their Plan B lived behind the same dashboard that was timing out. The top comment on the Hacker News thread was a set of curl commands for moving domains off Cloudflare’s proxy edge. Admins were stuck in 2FA flows trying to fetch an API token, or searching for Terraform credentials so they could toggle a proxied flag. That is not a resilience strategy.

We learned this lesson the hard way—and wrote about it after the 2021 Fastly outage in How to have a Plan B. The rule still stands: the platform you are trying to leave cannot be the only place that can change where your DNS points.

Detect: understand what’s actually broken

Incidents like Tuesday’s change shape quickly. Cloudflare’s own feed showed different failure domains every 30 minutes: bot management, dashboard auth, Access, WARP. The first mile is impartial telemetry that tells you what your users feel, not what the provider thinks. At Peakhour we stream real user monitoring, synthetic checks, and control-plane health from multiple CDNs and DNS partners. That lets us distinguish “cache errors in Hong Kong” from “global auth outage” and choose the right lever.

Decide: keep DNS authority in neutral territory

When your domain delegation lives with agnostic providers—Route 53, NS1, Azure DNS, or the enterprise registrar your legal team already approved—you can make failover decisions without pleading with a failing control plane. Peakhour doesn’t replace those vendors; we orchestrate them. We set short-but-safe TTLs, keep secondary answers staged, and continuously audit API access so we can flip traffic with one signed request. The minute you outsource DNS authority to a proxy CDN, you have given up the control that makes Plan B possible.

Divert: run the playbook in minutes, not hours

A workable Plan B has three moves:

Pre-stage alternate edges. Your secondary CDN, origin, or transit provider must be in sync with the active one—certificates, cache rules, WAF policies, everything. We keep them hot by replaying production configs across vendors.
Wire DNS automation. We integrate with multiple third-party DNS APIs at once so we can update apex A/AAAA, flattened CNAMEs, and geo/latency rules in a single workflow. Because the automation lives off the impacted platform, we can execute even while Cloudflare’s dashboard is returning 500s.
Drill humans on the handoff. Our SOC sits in Sydney and Melbourne, but we cover global hours. During an incident we line up Slack/Teams bridges with your SREs, confirm business impact, and keep execs in the loop while traffic drains to the healthy provider.

With that in place we routinely hit sub-five-minute diversion times, including DNS propagation, because the decision, the tooling, and the people are ready before the outage hits.

What Peakhour brings to your Plan B

Independent authority, familiar vendors. We leverage multiple established DNS providers instead of locking you into ours. You keep your contracts; we bring the automation and guardrails.
Unified multi-CDN config. Cache rules, image optimisation, WAF, and routing policies stay aligned across providers so you don’t lose capabilities when you switch.
Real drills, not just runbooks. Quarterly failover exercises prove that certificates, APIs, and humans are ready. We share the post-mortems so your execs see clear RTO/RPO numbers.
People you can phone. 24×7 Australian-based engineers who know your stack and can execute the play while your own team communicates with customers.

Book a resilience review

If Tuesday exposed that your failover path still depends on your primary provider’s dashboard, book a 30-minute Resilience Review with Peakhour and we’ll:

Map who really controls your DNS today.
Identify the gaps between your primary and standby CDNs.
Outline the automations we can layer on top of your existing DNS and hosting vendors.

The output is a concrete Plan B, a drill schedule, and a team that can execute it the next time a global provider blinks.

#CDN #DNS #Multi CDN #Incident Response