Recently, the 365 Main colocation facility in San Francisco went down due to local power outages and related failures with their backup systems. The best detailed technical description was at Scobelizer (read the comments) and the best technical overview of the entire situation is at CNET News.com. Other reports can be found at:

DataCenter Knowledge

What with all of this happening, it has me thinking of single points of failure. To completely handle all disasters, there must be complete redundancy throughout the system – and there is no way to keep costs down while making a complete redundancy set up. From the description given at Scobelizer, there is a complete and sophisticated power backup system, but the multiple losses of power caused the backup system to stop providing power.

In the case of 365 Main, it sounds as if their backup power was not redundant – though one could spend a lot of money in trying to be completely and utterly redundant. Lets consider all of the parts that need to be redundant if complete redundancy was required:

  • System software
  • Applications
  • System hardware, including power (power supplies)
  • Data center power
  • Data center air conditioning
  • Backup power
  • Data center location
  • Backup tapes
  • Backup system
  • Notification system
  • Notification “chain-of-command”

And certainly there are more. The possibility of failure will never be zero. It is almost foolish to guarantee 7 days a week, 365 days a year up time. Failures happen, and will affect us. Let’s consider some of the ways that 365 Main and its customers might have avoided their troubles.

First, 365 Main did not have differing backup power systems, from what has been said. The method of generating backup power remained the same throughout. A completely different backup power source, using different methods would in all likelyhood have prevented their power loss. No doubt it would have prohibitively expensive, however…

Alternately, the customers of 365 Main could have had secondary hosts elsewhere in a geographically widely dispersed cluster. If one host goes down (in San Francisco, say) then the hosts in Denver, New York City, and Chicago could keep going.

Having redundancy throughout can be expensive, but not having redundancy can also be expensive, as some major web sites found out recently.

Advertisements