I like reading downtime reports, because it shows what can happen and how people and departments respond to the crisis. There were two sites that experienced downtimes over the weekend – one very well known and one not.
WordPress.com went down over the weekend, disrupting thousands of blogs, including VIP subscribers. According to the report, the data hosting company had an unscheduled change take place in a router, resulting in wordpress.com responding to a fraction of the requests coming in. This meant that wordpress.com was not down, just inaccessable to 90% of incoming traffic. The failover mechanism was not activated, presumably because the host was not down – rather its ability to serve up web pages was hampered – the server itself was running fine.
This suggests the following improvement areas (speaking overall):
- Use some sort of change control – and test changes when made. This unscheduled change very likely did not just affect wordpress.com, but perhaps many others.
- Monitor not just the server, but paths into the server – everything between the customer and the server.
- Failover mechanisms should be sensitive to not just server performance, but anything that affects the presenting of web pages to the public (or whatever service is being offered).
- Relying on a single hosting provider (at one time) means that any problems that arise at that hosting provider affect your service in its entirety; relying on multiple providers in a cluster configuration means that if one hosting provider drops, your service continues (though degraded slightly).
The other site that went down was jdorganizer.com (the web site for Jeri Dansky: Professional Organizer). Since she used to be a system administrator before being a professional organizer, she knows IT. As a user, she had to respond to the outage she experienced (again caused by the data hosting provider).
Jeri explains on her blog what happened, and how she responded as a user of services. She lists the things she learned from the experience, in particular preparing a disaster plan and reviewing it.
Another thing she did was to switch providers when she no longer trusted hers to provide reliable services; being of a technical bent, she was able to make the switch and configure things reasonably easily. She had someone check availability and fixed the problems that arose.
Both of these experiences provide a window into how companies and other users of hosting services can respond when things fail. In both of these cases, the providers failed: the response from the users of the hosting provider services can help us to learn what to do if and when it happens to us.
Kudos to the WordPress.com team for keeping the blogs running, and kudos to both for being willing to tell us what happened (in delightfully complete technical detail…).