Home > Guide > Amazon's Catastrophic Cloud Outage

Amazon's Catastrophic Cloud Outage

Published on 4/25/2011 by

Web hosting lessons we can’t ignore

Last Thursday was a bad day for Amazon.com. At approximately 5am on April 21st, Amazon’s North Virginia data center experienced what was called a “network event.” That “event” created a domino effect that rippled across the Internet, taking major sites like Reddit, FourSquare, HootSuite and Quora offline. Even today, some four full days after the big event, Amazon’s Service Health Dashboard is still reporting connectivity and latency issues in the North Virginia data center. What was once thought of as one of the most robust clouds in the hosting industry is now is now being looked at through skeptical eyes.

What does this say about the business of hosting?

Ignore buzzwords, pay attention to facts

The word cloud has become a bit of a buzzword in the last two years. We have cloud hosting, cloud storage, cloud backups and even Google is developing a cloud computing system and cloud based computer. But what does it all mean?

When it comes to hosting, cloud-based web hosting translates quite simply into hosting space that is spread across the cloud infrastructure.  Amazon happens to use something called Elastic Computer Cloud (EC2), a redundant failover setup that is supposed to scale up to handle giant spikes in traffic, preventing your website from going down.

The fact is that nothing is completely fail proof. Google, Yahoo, your telephone and cable company and yes, even Amazon, occasionally have to deal with outages.  They’re rare, but they do happen. Cloud hosting is a new business model in the world of hosting, but that doesn’t necessarily make it perfect or 100% reliable in every situation.

Have a contingency plan

While Reddit, FourSquare, HootSuite and even portions of the New York Times were offline, SmugMug—a highly regarded online photo editing and sharing website that also utilizes Amazon’s cloud services—stayed up. Why? SmugMug’s Don MacAskill said this: “Despite using a lot of Amazon services, SmugMug didn’t go down because we spread across availability zones and designed for failure to begin with.” Designed for failure? Yes! Planning ahead for the worst case scenario and knowing the limitations of your hosting setup can keep you online when everyone else goes down.

Understand Service Level Agreements

Service Level Agreements (SLAs) spell out your contractual arrangement with your hosting company. Despite being down for nearly four days, this beast of an outage did not violate Amazon’s SLA, meaning that customers will not receive any sort of compensation for the unexpected offline time.  This might seem unfair, but Amazon’s SLA clearly states that Amazon “guarantees 99.95% availability of the service within a Region over a trailing 365 period.” Since it wasn’t the EC2 that failed, but rather EBS and RDS services, Amazon has not violated their SLA.

Amazon’s outage has shown us that all web hosting is susceptible to network problems. Just as car traffic will eventually cause a backup or slowdown in flow no matter how good of a driver you are, so too can hosting be impacted by network problems despite the good reputation of your hosting company. By knowing the limitations of your particular type of hosting and having a solid backup plan in place so that you’re never caught off guard, you’ll keep your website online when everyone else around you is going down.