On 24 February 2010, the Google App Engine suffered an outage as an entire data center lost power. The Engine was down for two hours as staff worked feverishly to fix problems after power came back up.
Google released a detailed downtime report which has been called a near-perfect example of a good report. Data Center Knowledge summarized the event well in an article; they have also spoken with Google previously about how they handle outages.
Google also kept people apprised of what was happening during the outage as well.
Google’s handling of a data center outage stands in stark contrast with the handling of a 6 March 2010 outage at Datacom in Melbourne, Australia. The story is just incredible. The data center’s managing director said there was absolutely no outage; customers, the company’s network operations center (NOC), and the press all disagreed – and backed it up with pictures.
Some people seemed upset that pictures were taken inside the data center and published on the Internet and in the press (now cellphones have to be dropped off at the door) – yet, this is what a whistleblower does. If this event had been handled differently, no doubt Datacom would have been better off.