Recently, Amazon has made the news for several unrelated outages at different data centers. The causes of these outages are very interesting, and provide a lesson for the rest of us.

The most recent affected Amazon EC2 users on 13 May. This outage was caused when a cutover from the loss of utility power failed: a switch failed to activate as it should. To make matters worse, the switch failed because of a misconfiguration that was done by the manufacturer.

This outage is compared to a similar one that affected Rackspace in 2007. During a power outage from the utility (because of a car crash), the power went on and off a couple of times, preventing the cooling apparatus from cooling the data center properly. With the rising heat in the data center, they had to shut down equipment or suffer equipment failure.

Another power loss affected an Amazon data center on 4 May (twice). The day was to involve a switchover from one power substation to another (from the electric utility). During this process, one UPS failed to cutover to the backup generators, resulting in an outage to a number of servers. Later that day, after bypassing the failed UPS, human error caused one of the backup generators to shut down, taking down servers once again.

One of the biggest problems in resolving these problems is money. Until something like this happens to a company, many do not want to put forward the money that it would take to avoid such outages.

What would it take to avoid outages like these? The assumption will be that power outages will occur and colling will fail; how do we handle it? How do we increase reliability?

  • Don’t rely on the UPS to save the data center. Use multiple UPSes or put each rack on a racked UPS.
  • Cluster multiple servers together across data centers in an active/active configuration so that downtime in one data center will be mitigated by another.
  • Pair administrators or team members together to reduce human error. Studies show that the worst evaluator of our own capabilities is ourselves (“We have met the enemy and he is us.”)
  • Make sure that cooling devices can run on backup power. Also verify that cooling devices will run long enough to shut down servers if necessary.
  • Investigate alternative cooling sources, such as external air, or “rack-local” cooling that could run off of UPS power.
  • Don’t trust the manufacturer. Nine times out of ten things might be just fine; that leaves a 10% failure rate. Shoot for 100%!
  • Remove all singular paths of failure. What if a UPS fails? What if an air conditioning unit fails? What if the power fails?
  • Test before things happen! Do you have complete protection from power failures? Test it. Do you have complete protection against cooling loss? Test it.

Even smaller companies with a single on-site data center could take advantage of some of these options without breaking the bank. Clustering many servers with an external data hosting site could be done, or a company could partner with a related company for an external server hosting site.

To save labor costs, one could extend the time needed to complete the tasks without adding any more staff to the project.

One other thing that could be done for reliability’s sake – but is probably cost prohibitive for all but the most voracious and deep-pocketed corporations – add a second power link from a different substation and different links. Power goes “out” from one substation, the other substation will continue to provide power. (Not sure that is even possible…)

Note that monitoring does not take the place of testing. There is a difference between knowing that the data center is without power and preventing the data center from losing power.

System checks do not take the place of testing. The UPS might claim to be working fine – do you trust it? Perhaps it will work just fine – and maybe not. Perhaps it will work, but it is misconfigured.

Test! One of my favorite stories about testing has to do with a technical school that provided mainframe computing services for students and faculty. They tested a complete power cycle of the mainframe each month. One month, they went through their testing, rectifying whatever needed rectifying as they went. Two days later, a power loss to the entire school during a thunderstorm required an active restart of the mainframe – without incident.

Would that have been as smooth without testing? I doubt it.