How often to you see something like this on the web?
I encounter this almost every week on Twitter’s web site. Or some Twitter client I use cites API problems. Don’t mean to pick on them, there are a lot of examples: Tumblr’s been down. Amazon has an outage. Comcast, too. WikiLeak sympathizers have been harassing certain sites with DDoS attacks. Sunday night scheduled maintenance on web sites and cellular systems make it a good time to sit down and read a physical book.
I’ve spent a career running broadcast and distance learning telecom systems that are designed for many more “nines” of reliability than I perceive some of our major web operations to be achieving (99.9% = 526 minutes per year outage, 99.99% = 53 minutes per year outage). I’d be surprised if, in the aggregate, Twitter is even getting three nines.
If we’re serious about making web and mobile media competitive with broadcast media, then someone needs to figure out how to improve the reliability we’re getting today.
Update 14 December 2010:
I'm "promoting" Stephen Hill's comments below on how difficult this level of reliability is to achieve. The point of my post was to suggest that we have a long way to go before we can put live radio on the web with the scale and reliability to which we've become accustomed in broadcasting -- and Stephen makes that point even better. --Dennis
The rhetoric of this post betrays some of the differences between traditional broadcast infrastructure and the new era of web-based infrastructure.
Web services are built on hardware and network infrastructure, but operate entirely on software. Even highly standardized software like the Apache web server has hundreds of variables in setup and operation which increase entropy and decrease reliability. When you add custom software to create any kind of practical web service, the variables (and therefore the possible bugs) multiply exponentially.
I'm not a CTO or even close, but as a small web music service provider, we have been forced to grapple with the Inescapable Truths of Online Reliability, which go something like this:
1. Reliability is inversely proportional to complexity in a hardware/software system.
1A. The larger the number of users and/or more functionally sophisticated the site, the more complex the hardware/software system must be....therefore the less reliable.
2. Reliability can be bought at a premium by adding additional servers, load balancing, "hot spares" and redundant functionality. However:
2A. Each increase in real or virtual (cloud) hardware and software makes the overall system more challenging to manage. More servers are also more attractive to attack and increase security issues unless the right preemptive steps are taken to defend them.
3. You can buy "five nines" of uptime (= 5 min/year of downtime) for a big premium, but it can never be 100% guaranteed. Each .9 increase in reliability will be roughly 5 to 10x more costly. Besides, all a guaranteed Service Level Agreement really gets you is a better attitude from the vendor and a credit when things inevitably fuck up.
4. The growth curve of the most highly visible and successful Internet sites (like Twitter) makes the problem of scaling infrastructure under load 100x more difficult.
5. It makes more sense to plan for minimizing recovery time after an outage, not preventing them completely.
6. Except for the goal of 100% reliability, broadcast infrastructure management practices are largely irrelevant. The valid comparision would be to an entire broadcast network, not a single station. Broadcast infrastructure remains at a relatively fixed size once operational regardless of the number of listeners/viewers, and can be optimized over time. Digital network infrastructure has to "scale" and change constantly over time to support millions of users and is much more difficult to optimize and manage.
Considering the above, it is quite a remarkable achievement that some sites, like Google, Amazon, Flickr, Yahoo and Facebook, are as day-to-day reliable as they are. Twitter is a particularly troubled example of a site that has had difficulty keeping up with its growth.
BOTTOM LINE ON WEB RELIABILITY: Easy to say -- very, very hard to do.