When I worked for a cabinetmaker for a number of months, I asked him about the various tools available and about which ones he used. One thing he said a lot was “that brand is okay, but in a production environment it won’t last.” That is, the cabinetmaking shop would go through the lightweight tools quickly and was different than a home woodworker.
So it is in system administration. When you have users counting on a system to be up, even a planned system outage is going to be extremely unpleasant and costly. Add up the cost of every worker not being productive, projects delayed, overtime paid, and so on. If the planned outage then becomes even longer than expected, you can see the costs begin to add up.
Thus, there are a number of things to keep in mind that would not matter in a non-production environment – an environment where a system outage means you don’t get to play Marble Blast Gold for a couple of hours. The details of planning an outage of a production system can make up an entire book, but here are some things to keep in mind:
- Communicate benefits. Set up meetings and show the users the benefit they will receive from the system outage. Don’t tell them that they’ll get more memory: tell them the system will be faster and respond quicker. Don’t tell them there’ll be more disk space: tell them they’ll be able to store more data.
- Plan for failure. Think of this as disaster recovery planning for the project – or as welcoming Murphy into your plans. Anything that can go wrong will – so plan ahead as to what you will do when it does.
- Minimize downtime. In whatever ways you can minimize downtime means cost savings to the company – cost savings that they can pass on to your or the customer. It also makes the higher-ups happy, which is always a good thing.
- Test – then test some more – then test again. Make trial runs and see if it works. Make detailed plans of what to do and what might happen. Test to see if things worked properly – then test again.
- Make backups! Backup the system just before the major change (just in case) and then back it up again just afterwards. Set aside these tapes – and in the meantime, keep the regular daily backup rotation going. Then if you have to roll back, you can.
- Make checklists. Sure, you didn’t miss anything the first time – but what about the second time? Can you replicate every step all the way through, without missing any and without doing anything different? Did you test everything or did you miss one? Make checklists – as David Allen would say, “Get it out of your mind!” (he’s right).
- Organize a schedule. When will the system be down? Let everybody know and discuss how long. Agree on a specific day and time.
- Decide on a pass-fail point. This could be thought of as a “point of no return” – if things are not going well or are not going according to schedule, what is the last moment (or last step) that you can successfully turn back and restore services as planned? Have one of these and stick to it. When that point is reached, determine whether there is room for error and whether everything is going well – or whether you must turn back.