You may have heard that Fastly, one of the world’s largest providers of CDN services, had an outage of about 1 hour on the 8th July. Some of the world's largest websites and services were down, including reddit, CNN, The Guardian, Shopify Stores, Stripe and Spotify, to name a few.
According to Fastly themselves, the outage was caused by a 'service misconfiguration' (Update: Bug triggered by a client changing their configuration), which propagated globally and took websites offline. When users tried to access a website using the Fastly service, they were presented with a Varnish 503 Guru Meditation error (for those of us old enough to remember, Guru Meditation is a geek reference to the Commodore Amiga computer of the late 80s!). This generally occurs when there is an issue contacting the server that the website is actually hosted on. There were also some reports on twitter saying 'unknown domain'.
Essentially, Fastly took down its own network with a bad software update. Similar problems have affected other online platforms in the recent past, including Google, Amazon, and Cloudflare.
Why wasn’t there a Plan B?
Fastly is an excellent service, with an enviable reliability record. There is a reason why they're trusted by some of the world's largest websites to improve reliability and load times. However, the vast majority of Fastly clients still had to sit tight and wait for Fastly to fix the issue. Luckily this was only an hour. It could have been much longer.
Just like death and taxes, software outages are a certainty. The real story is not that Fastly had an outage. It is why didn't these large websites have a contingency plan for a single point of failure. For sites at that scale, this is a major oversight in infrastructure planning.
How to handle a CDN failure
The simple solution is to have a backup CDN provider already configured and tested, ready to switch over to if your primary provider fails. You can then utilise short expiry of DNS records to redirect users when the failure happens. This needn't be very expensive or complicated, although individual circumstances vary.
A Quick Introduction To DNS (Domain Name System)
Modern CDNs, like Fastly, Cloudflare, and Peakhour, operate as ‘reverse proxies’. This means they sit between a website's end users and the website server itself. They achieve this through DNS configuration.
When someone types a domain url into a browser, eg fastly.com, a request is sent to a DNS server with the host name (eg fastly.com) to find the IP address of the server to retrieve the content from. CDNs, like Fastly, get website admins to list the address of the CDN on the DNS server. That means requests for a website go through the CDN first. The process is analogous to listing someone else’s number in the phone book so they take calls for you.
The DNS server has a TTL (Time To Live) associated with its records. This TTL tells whoever asked for an IP address, for a given hostname, to remember the answer and not ask again until after the TTL has passed. Typically DNS record TTLs will be 1 hour, but they can be shorter, eg 1 minute.
Switching providers in case of an outage
By keeping a short TTL in DNS, webmasters can switch the answer for a DNS request to that of another provider, meaning users can quickly be directed to an alternative Cloud Provider. Once service has resumed on the primary provider, DNS can be switched again so normal traffic is resumed. The key is that the alternative provider is configured, tested, and ready to go.
This switch can even be automated to minimise outages. Premium DNS services, like Amazon’s Route 53, have optional health checking of DNS answers. This allows a switch to happen nearly instantly. The only downtime would be for people already on the site who have to wait for the TTL to expire before being directed to the backup Cloud Provider. In fact this is exactly what Peakhour.io does. In the event of a catastrophic outage we use DNS to switch to backup infrastructure so our clients are minimally affected.
Backup provider options
Now we've shown how switching CDN providers can be done, let's compare the major players and how they might serve as a backup CDN for Fastly. The three things we'll look at are Cost, Features, Integration.
Simply route traffic to the origin
This would be the simplest and most cost effective option, Assuming your origin server can handle the increased load that removing its CDN would entail. It also assumes that it's ok to lose any features that you may have been relying on, eg load balancing, WAF, edge scripting, image optimisation etc.
Cloudflare
Many people use Fastly because it uses Varnish, a richly featured, programmable cache with several advanced features. If you rely on those features, eg cache tags, cache on cookie value, custom cache tags, then you have to be on Cloudflare's top plan, which is not cheap.
The other major drawback of Cloudflare is that, unless you are on the most expensive plans, you have to cede control of DNS to them by delegating your domain. Cloudflare DNS is a great service, however it has the major drawback of caching negative DNS requests for an hour. If you were switching from an A record to a CNAME record or vice versa, you could be down for an hour regardless. Not ideal.
Akamai
Akamai has a highly respected, fully featured, and very expensive product. Maintaining a backup option with them will run into the $1000s a month. Only you can decide whether it’s worth it.
Cloudfront
Amazon's CDN offering is the third of the big three alternatives. Since it uses volume based billing, it could be an attractive CDN option as a standby, as long as you don't mind missing out on cache by tag (sorry Magento and Drupal). It is also complicated to configure for dynamic content and could miss features that you need. In fact most people use Cloudfront for static content, eg images, CSS, etc and run a Varnish instance within AWS to provide easier to configure full page caching.
This is what the BBC did with the Fastly outage. They had their backup infrastructure on Cloudfront and, as of time of writing, hadn't switched back to Fastly.
Peakhour.io
Peakhour is also volume based billing with a minimum monthly charge of $20. We provide all the advanced caching features that Fastly does, as well as WAF and image optimisation as standard, all in the one service fee. We don't require you to cede control of DNS to us and we're Australian owned and based.
Final Thoughts
CDNs, no matter how big, can fail. If your website is important then it needs a Plan B. This is how that Plan B works, and it doesn't have to be expensive when using a provider like Peakhour.io.
The important part is having it configured and tested before you need it.