A system failure does not always have a single, identifiable (and preventable) cause. In fact, often it does not.
The BP oil spill provides an excellent case study of this situation: the spill has numerous contributing causes, most of which seem minor or unlikely in isolation, but which taken together resulted in catastrophe.
What general principles can we apply to prevent such cascade failures? One should already be familiar: Test, test, test fail-over systems before they are needed. Without testing, there is no guarantee that things will work as expected. In the BP oil spill, several failsafe systems did not properly engage as expected. In the recent IT disaster in Virginia, a storage system did not fail over when needed, resulting in a down-time of several days for a significant portion of state resources.
Another principle is: Sift and winnow alarms to the truly necessary. Pay attention to alarms and other indicators, and to purge alarms and notifications that are irrelevant or unnecessary. In the oil spill, one of the causes was that the indications that there was an immediate problem were masked by other occurrences at the time. Too many alarms and too much information can hide problems and lead to administrators ignoring problems that require intervention.
To improve alarms, and prevent unnecessary alarms, use time based alarms, i.e., alarms that account for change over time. Products such as SGI’s Performance CoPilot (PCP), HP’s Performance Agent (included with some versions of HP-UX), and the System Event Correlator (SEC) all help. With alarms that work over a time-span, momentary brief peaks will not result in alarms, but chronic problems will.
Another item that can improve alarm response is systemic alarms: that is, monitoring that accounts for multiple systems or processes, and combines it all into an appropriate setting. Is the web site running smoothly? Are all systems reporting logs to the central server? Are all virtual environments running?
One of the earliest problems that lead to the oil spill was a lack of oversight of the contractors involved in building the well. Each assumed that the others would do the right thing. In system administration, we assume that other systems are functioning correctly. To prevent failure, we should assume that other systems and processes could fail, and verify for ourselves that the systems we are responsible for will not fail in any case. What systems are you dependent on? Power? Cooling? Serial concentrator? Operations staff? Backup staff?
Part of the problem in the Virginia failure was that the state did not oversee the external vendor well. With proper oversight (and demands), the state IT staff could have forced the storage vendor to test the fail-over processes, and could have implemented a backup plan in case the fail-over did not take place as expected.
By studying other’s fail patterns and experiences, we can help to minimize our own.