29 August 2010 Leave a comment
The Virginia Information Technologies Agency (VITA) has been dealing with an outage that resulted during a scan by storage technicians for failed components. The outage “only” affects 228 of 3,600 servers – which affects 24 different Virginia departments, including the Virginia Dept. of Motor Vehicles (DMV), the Governor’s Office, the Dept. of Taxation, the Dept. of Alcoholic Beverage Control, and even VITA itself.
No word on how many Virginians are affected, but certainly it will be in the thousands.
According to the Associated Press, the crash happened when a memory card failed and the fall-back hardware failed to operate successfully. From the sound of it, an entire storage array was affected by this – how else to account for the 228 servers affected?
This suggests that the storage array is a single point of failure, and that the memory card was not tested nor its fall-back. There should be some way of testing the hardware, or to have a clustered storage backup.
One of the biggest problems in many IT environments – including states – is budget: having full redundancy for all subsystems is expensive. States are not known for budgets that fill all departmental needs (despite large budgets, departments scrounge most of the time…). Many other data center owners consistently have tight budgets: libraries, non-profits, trade associations, etc.
The usual response to tight budgets is to consider the likelihood of failure in a particular component, and to avoid redundancy in such components. A better way to look at it would be to compare not the likelihood of failure, but the cost of failure. The failure in Virginia’s storage was supposed to be extremely unlikely, but has had a tremendous cost both to Virginia and to Northrup Grumman.
Virginia’s IT was outsourced in part to Northrup Grumman, and was supposed to be a model of privatizing state IT in the nation. However, Virginia’s experience shows how privatizing can fail, and how outsourcing companies do not necessarily provide the service that is desired.
The Washington Post reported on this failure, as well as others in the past. There have been other failures, such as when the network failed and there was no backup in place. Both Virginia and Northrup Grumman should have noticed this and rectified it before it was necessary.
The Richmond Times Dispatch has an article on this outage, and will be updating over the weekend.