Virginia Experiences a Severe IT Outage

The Virginia Information Technologies Agency (VITA) has been dealing with an outage that resulted during a scan by storage technicians for failed components. The outage “only” affects 228 of 3,600 servers – which affects 24 different Virginia departments, including the Virginia Dept. of Motor Vehicles (DMV), the Governor’s Office, the Dept. of Taxation, the Dept. of Alcoholic Beverage Control, and even VITA itself.

No word on how many Virginians are affected, but certainly it will be in the thousands.

According to the Associated Press, the crash happened when a memory card failed and the fall-back hardware failed to operate successfully. From the sound of it, an entire storage array was affected by this – how else to account for the 228 servers affected?

This suggests that the storage array is a single point of failure, and that the memory card was not tested nor its fall-back. There should be some way of testing the hardware, or to have a clustered storage backup.

One of the biggest problems in many IT environments – including states – is budget: having full redundancy for all subsystems is expensive. States are not known for budgets that fill all departmental needs (despite large budgets, departments scrounge most of the time…). Many other data center owners consistently have tight budgets: libraries, non-profits, trade associations, etc.

The usual response to tight budgets is to consider the likelihood of failure in a particular component, and to avoid redundancy in such components. A better way to look at it would be to compare not the likelihood of failure, but the cost of failure. The failure in Virginia’s storage was supposed to be extremely unlikely, but has had a tremendous cost both to Virginia and to Northrup Grumman.

Virginia’s IT was outsourced in part to Northrup Grumman, and was supposed to be a model of privatizing state IT in the nation. However, Virginia’s experience shows how privatizing can fail, and how outsourcing companies do not necessarily provide the service that is desired.

The Washington Post reported on this failure, as well as others in the past. There have been other failures, such as when the network failed and there was no backup in place. Both Virginia and Northrup Grumman should have noticed this and rectified it before it was necessary.

The Richmond Times Dispatch has an article on this outage, and will be updating over the weekend.

A Single Character Causes Downtime for… WordPress.com!

Last Thursday, an error in the wordpress.com software caused some user settings to be overwritten, which resulted in loss of settings for some customers. The site was taken down for checks, and an hour later, 99% of users were back online.

The cause of the error? A coding error of a single character. Certainly checks and balances are needed, but according to Matt Mullenweg, founder of WordPress.com, they are already using reviews and testing.

It was less than a month ago that Toni Schneider, CEO of Automattic, wrote in glowing terms about the use of “continuous deployment” at wordpress.com. Is this event going to lead to the death of “continuous deployment” at WordPress? I suspect not.

In fact, Paul Graham described in a paper how he used Lisp for Viaweb in just this fashion. Viaweb was bought by Yahoo! and became the Yahoo Store. Viaweb would fully implement features before it had even become mainstream.

Let this WordPress.com downtime be a lesson as to what a single character can do, and also a lesson in how none of us are immune from such mistakes.

Arranging for downtime

Downtime is inevitable: servers need to be updated, patches applied, hardware upgraded or fixed, and so on. But how do you choose when to take the server down (if you have a choice)?

First of all is to ask the affected users of the system. This can range from getting the supervisor or person in charge to give an OK to having meetings with all affected. In some fashion, the users who are affected need to know and need to have input into the best time to take the server down. When the server is down, they may not be able to do much of their work; minimizing impact on them is important.

To also minimize impact, the downtime can often be arranged during off-hours. In the U.S., this is typically outside of the normal workday of 8 a.m. to 5 p.m. (800-1700). However – if the server is in use in other time zones, this window gets bigger – and if the server is used world-wide, the window gets bigger still.

The other thing to think about is which day to take down the server. One can do it during the work week (Mon. through Fri.); however, all of those nights (except Friday) are constrained by the fact that the availability deadline would be something like the next day at 7:00 a.m. When downtime is scheduled for Friday night, there are two entire days more to get things right. If a downtime is necessary for a major upgrade or extensive changes, then scheduling for Friday night gives that much more time to get everything working before Monday morning.

Planning for a system outage: 8 ways to make it easy

When I worked for a cabinetmaker for a number of months, I asked him about the various tools available and about which ones he used. One thing he said a lot was “that brand is okay, but in a production environment it won’t last.” That is, the cabinetmaking shop would go through the lightweight tools quickly and was different than a home woodworker.

So it is in system administration. When you have users counting on a system to be up, even a planned system outage is going to be extremely unpleasant and costly. Add up the cost of every worker not being productive, projects delayed, overtime paid, and so on. If the planned outage then becomes even longer than expected, you can see the costs begin to add up.

Thus, there are a number of things to keep in mind that would not matter in a non-production environment – an environment where a system outage means you don’t get to play Marble Blast Gold for a couple of hours. The details of planning an outage of a production system can make up an entire book, but here are some things to keep in mind:

  1. Communicate benefits. Set up meetings and show the users the benefit they will receive from the system outage. Don’t tell them that they’ll get more memory: tell them the system will be faster and respond quicker. Don’t tell them there’ll be more disk space: tell them they’ll be able to store more data.
  2. Plan for failure. Think of this as disaster recovery planning for the project – or as welcoming Murphy into your plans. Anything that can go wrong will – so plan ahead as to what you will do when it does.
  3. Minimize downtime. In whatever ways you can minimize downtime means cost savings to the company – cost savings that they can pass on to your or the customer. It also makes the higher-ups happy, which is always a good thing.
  4. Test – then test some more – then test again. Make trial runs and see if it works. Make detailed plans of what to do and what might happen. Test to see if things worked properly – then test again.
  5. Make backups! Backup the system just before the major change (just in case) and then back it up again just afterwards. Set aside these tapes – and in the meantime, keep the regular daily backup rotation going. Then if you have to roll back, you can.
  6. Make checklists. Sure, you didn’t miss anything the first time – but what about the second time? Can you replicate every step all the way through, without missing any and without doing anything different? Did you test everything or did you miss one? Make checklists – as David Allen would say, “Get it out of your mind!” (he’s right).
  7. Organize a schedule. When will the system be down? Let everybody know and discuss how long. Agree on a specific day and time.
  8. Decide on a pass-fail point. This could be thought of as a “point of no return” – if things are not going well or are not going according to schedule, what is the last moment (or last step) that you can successfully turn back and restore services as planned? Have one of these and stick to it. When that point is reached, determine whether there is room for error and whether everything is going well – or whether you must turn back.