The recent Wikipedia outage shows the problems with a typical failover system. The European data center that served Wikipedia’s servers there experienced a cooling failure, which caused the servers to shut down. This is a typical failure that can occur (though it should be prevented).
The event was logged in the admin logs starting at 17:18 on March 24. All of Wikimedia’s server administration is at wikitech.wikimedia.org.
What happened next extended the outage longer than it should have been: the failover from the European data center to Wikipedia’s Florida data center failed to complete properly.
Certainly, to prevent this failure, the failover (and fail-back) could have been tested further, the process refined, and the tests done routinely.
However, there is another possibility: use an active standby instead. That is, instead of having a failover process kick in when failure occurs, use an active environment where there are redundant servers serving clients.
If you have a failover process, it is a sort of “opt-in” – the servers choose to take over from the failed servers. Thus there is a process (the failover) that must be tested, and tested often to make sure that it will work in a normal situation. Testing also means in many cases that an actual service outage must be experienced. This is an active-passive high availability cluster model: the passive server must be brought online and take over from the failed nodes.
Using an active but redundant environment means that if any server – or data center – dies, then service is degraded slightly and nothing more. This describes an active-active high availability cluster model. There is no need for monthly testing – and perhaps, no testing at all: during upgrade times, the servers can be taken out of service one at a time and the results monitored.
The usual argument against such redundancy is cost: the redundant servers need to be able to take on a particular load, which is thus unavailable to other uses in normal operation. Yet, how much downtime can you experience before you start losing money or public good will?
If Wikipedia had put their servers together so that a failover was not necessary, it might have saved them from going down for several hours.