How Systems Fail (and Principles of Prevention)

A system failure does not always have a single, identifiable (and preventable) cause. In fact, often it does not.

The BP oil spill provides an excellent case study of this situation: the spill has numerous contributing causes, most of which seem minor or unlikely in isolation, but which taken together resulted in catastrophe.

What general principles can we apply to prevent such cascade failures? One should already be familiar: Test, test, test fail-over systems before they are needed. Without testing, there is no guarantee that things will work as expected. In the BP oil spill, several failsafe systems did not properly engage as expected. In the recent IT disaster in Virginia, a storage system did not fail over when needed, resulting in a down-time of several days for a significant portion of state resources.

Another principle is: Sift and winnow alarms to the truly necessary. Pay attention to alarms and other indicators, and to purge alarms and notifications that are irrelevant or unnecessary. In the oil spill, one of the causes was that the indications that there was an immediate problem were masked by other occurrences at the time. Too many alarms and too much information can hide problems and lead to administrators ignoring problems that require intervention.

To improve alarms, and prevent unnecessary alarms, use time based alarms, i.e., alarms that account for change over time. Products such as SGI’s Performance CoPilot (PCP), HP’s Performance Agent (included with some versions of HP-UX), and the System Event Correlator (SEC) all help. With alarms that work over a time-span, momentary brief peaks will not result in alarms, but chronic problems will.

Another item that can improve alarm response is systemic alarms: that is, monitoring that accounts for multiple systems or processes, and combines it all into an appropriate setting. Is the web site running smoothly? Are all systems reporting logs to the central server? Are all virtual environments running?

One of the earliest problems that lead to the oil spill was a lack of oversight of the contractors involved in building the well. Each assumed that the others would do the right thing. In system administration, we assume that other systems are functioning correctly. To prevent failure, we should assume that other systems and processes could fail, and verify for ourselves that the systems we are responsible for will not fail in any case. What systems are you dependent on? Power? Cooling? Serial concentrator? Operations staff? Backup staff?

Part of the problem in the Virginia failure was that the state did not oversee the external vendor well. With proper oversight (and demands), the state IT staff could have forced the storage vendor to test the fail-over processes, and could have implemented a backup plan in case the fail-over did not take place as expected.

By studying other’s fail patterns and experiences, we can help to minimize our own.

Virginia Experiences a Severe IT Outage

The Virginia Information Technologies Agency (VITA) has been dealing with an outage that resulted during a scan by storage technicians for failed components. The outage “only” affects 228 of 3,600 servers – which affects 24 different Virginia departments, including the Virginia Dept. of Motor Vehicles (DMV), the Governor’s Office, the Dept. of Taxation, the Dept. of Alcoholic Beverage Control, and even VITA itself.

No word on how many Virginians are affected, but certainly it will be in the thousands.

According to the Associated Press, the crash happened when a memory card failed and the fall-back hardware failed to operate successfully. From the sound of it, an entire storage array was affected by this – how else to account for the 228 servers affected?

This suggests that the storage array is a single point of failure, and that the memory card was not tested nor its fall-back. There should be some way of testing the hardware, or to have a clustered storage backup.

One of the biggest problems in many IT environments – including states – is budget: having full redundancy for all subsystems is expensive. States are not known for budgets that fill all departmental needs (despite large budgets, departments scrounge most of the time…). Many other data center owners consistently have tight budgets: libraries, non-profits, trade associations, etc.

The usual response to tight budgets is to consider the likelihood of failure in a particular component, and to avoid redundancy in such components. A better way to look at it would be to compare not the likelihood of failure, but the cost of failure. The failure in Virginia’s storage was supposed to be extremely unlikely, but has had a tremendous cost both to Virginia and to Northrup Grumman.

Virginia’s IT was outsourced in part to Northrup Grumman, and was supposed to be a model of privatizing state IT in the nation. However, Virginia’s experience shows how privatizing can fail, and how outsourcing companies do not necessarily provide the service that is desired.

The Washington Post reported on this failure, as well as others in the past. There have been other failures, such as when the network failed and there was no backup in place. Both Virginia and Northrup Grumman should have noticed this and rectified it before it was necessary.

The Richmond Times Dispatch has an article on this outage, and will be updating over the weekend.