Making the Case for IPMI

IPMI is a system that allows you to maintain a system where it would not otherwise be able to. However, you have to convince others that it will be useful. As obvious as it is for a professional system administrator, there are others who will not see the usefulness of IPMI. This process – making the case for a technology – comes up more than most system administrators might realize.

There are numerous situations that require having a person to be actually present at a machine to do things – such as entering the setup, accessing the UNIX/Linux maintenance shell, changing boot devices, and so forth. So how do you prove how beneficial implementing full IPMI support is?

Provide a business case – and perhaps an informal user story – to show how IPMI can reduce the need for a person to actually be present.

To really make the case for IPMI, compute the actual costs in making a trip to the data center – the hourly cost of the administrator, the driving costs in gasoline (both to and from!), and the costs associated with handling the expense report. Report the other unquantifiable costs – the cost of an administrator unavailable for other tasks during the four hour round trip, including projects being delayed and problems not getting resolved. Combine this with a user story.

For example, create a user story around a possible kernel panic. A user story requires an actual user – an individual – whose story is followed. Here is our example continued as a user story:

Alma received an email that a system (db20) was unresponsive. Checking the system found that there was no response from the network at all, and checking the KVM showed that there was a panic on the display. No keystrokes were accepted by the system, and there was no way to power-cycle the system.

So Alma sends an email stating that she will be unavailable for the rest of the day, and calls her babysitter to take come and take care of her three children for the evening. Then she gets into her car and drives the two hours to the data center and parks in a lot ($10 for one hour). She power-cycles the machine by pressing its front panel power button, and checks the system response using her laptop. She finds that the server is responding and logs in.

Then Alma checks the server: logs show no problems restarting, the system has restarted cleanly, the subsystems are all running, and the monitoring system shows all systems are good.

Alma leaves the data center and drives the two hours back home.

If Alma is paid $60,000 yearly, the cost of her time spent on this event is US$144.23. If she drove 320 miles round trip at .76 cents a mile, she gets US$243.20 as an expense – in addition to the US$10 in parking fees. This makes a total direct cost of US$397.43.

If something like this happens six times a year, then the total yearly cost is US$2384.58 – and total downtime for the server is 24 hours for an uptime rate of 99.72%.

This account doesn’t include the indirect costs – such as projects being delayed because Alma was unable to work on them, nor does it include the personal costs involved such as babysitting and time away from family. It also doesn’t include the time that HR staff spent on yet another expense report. It also doesn’t include the costs associated with the server being unavailable for three hours.

On the other hand, Polly received word that another server in the data center was unresponsive, and also found that the kernel had panicked and there was no response from the console. She then used a command line tool to access the baseboard management controller (BMC) through IPMI. With an IPMI command, she rebooted the server, and watched the response on the KVM. Checking the system over, Polly found that the server had booted cleanly, subsystems were operational, and all looked good.

If Polly is paid the same amount as Alma, and her response took 15 minutes, we get a total cost of US$7.21.  Downtime was reduced by 92% (along with an associated reduction in costs tied to the server being down). If this happens to Polly six times a year, the total yearly cost is US$43.27 – and a downtime of 1.5 hours for an uptime rate of 99.98%.

Thus, IPMI and SoL would have saved Alma’s company US$2377.37 per year.

The strongest case can be made if a recent event could have been solved with the technology you are proposing. If you can point to a situation that could have been resolved through ten minutes instead of hours or days without – then the usefulness of the technology will be apparent.

With this user story and business case, the case for IPMI should be readily apparent to just about anybody. Similarly, the case can be made for other technologies in the same way.

Bringing the Network up to (Gigabit) Speed

When looking at increasing network speed in the enterprise, there are a lot of things to consider – and missing any one of them can result in a slowdown in part or all of the network.

It is easy enough to migrate slowly by replacing pieces with others that support all of the relevant standards (such as 10/100/1000 switches). However, such a migration can bog down and leave old equipment in place and slowing down everyone.

First, determine if the infrastructure can handle the equipment. Is the “copper” of a sufficient grade to handle the increased demands? If not, then the cables will have to be replaced – perhaps with Cat-6 or better – or even fiber if your needs warrant it. Check for undue interference – fiber will not receive interference that copper would.

After the cabling is ready, check the infrastructure – all of it. It can be easy to miss one. Also check the capabilities of all. For example, can the switch handle full gigabit speeds on all ports at once? You might be surprised at the answer.

Once the equipment is in place – make sure that all clients are using gigabit speeds. Most switches should have indicators that tell if a port is running at a gigabit or not.

Make doubly sure that servers are running at full speed, as a slowdown there will affect everyone who uses that server. This becomes doubly important in the case of firewalls because of the impact.

Lastly, don’t forget telco equipment. If the connection to the T1 is still running at 100 megabits, then this will slow Internet access for the entire enterprise down.

One more thing – an upgrade such as this would be a perfect time to get more advanced equipment in house. Just be concious of the corporate budget. In such cases, it also helps to present improvements that the executives can see and experience personally rather than some elusive benefits that only the IT staff will see.

Good luck in your speed improvement project!

User Experiences with System Updates (Fedora)

There has been a lot of discussion about the direction of Fedora, apparently brought on in part by massive updates being pushed to recent Fedora users. I’ve not used Fedora for a long time – for various reasons – but this is good discussion.

Máirín Duffy has a fantastic article (it is a must read!) describing possible user profiles for Fedora Linux, as well as a description of how updates should be done. One thing she mentions that I haven’t seen in Linux before is the idea of an “update bundle”.

Bundled updates are done in many UNIX environments, including HP-UX and Solaris. In my experience with HP-UX, all of the updates are available separately, but there are also patch bundles that are put through strenuous QA together. With all of the patches run through QA as well as the bundle, this leads to better stability for the environment.

Also in HP-UX, each patch is posted with a “patch readiness level” that explains what amount of QA the patch has actually seen so far. Thus, you can load that patch early if you want, or wait until its rating (given in a star rating) rises, indicating less likelihood that the patch will break the system.

Most Linux systems are run through QA as a system, with each package being tested and sent through QA individually.

With all of the discussion that I read, it seems almost like a new spin or distribution is almost necessary which would capture the essence of Fedora with the stability found in Debian Stable or Ubuntu LTS. The camps seem to be split between two sorts of people:

  • People who want the latest versions of everything all the time, no matter how many updates are needed.
  • People who want a system that is stable and doesn’t change.

If the current leadership is unwilling to accommodate those that want stability and reliability, then they will go somewhere else to find it. This would be a dramatic loss to the Fedora community in my opinion.

Munich’s Migration to Linux: Troubles

The city of Munich chose to migrate to Linux in 2003 because Microsoft would not support the software they were using. Recently the head of the project (called LiMux), Florian Schießl, reported on why the project was taking longer than previously expected. An article on H Online describes the situation well.

The problems were, perhaps, avoidable. Suitable planning and incremental testing would have avoided the problems that Munich has experienced so far.

One problem was with proprietary servers that do not work well with open source clients; one in particular that Florian mentioned was DHCP. A proprietary DHCP server would hand out leases that were incompatible with the Linux clients.

Another problem was with applications that needed to run on the client. For example, some applications required ActiveX which is unavailable on Linux. Another example was the dependence on many Visual Basic macros (VBA) in Microsoft Word.

With a revised plan since 2007, the migration has gone much more smoothly and rapidly. Departmental migrations are now begun with a pilot project, and the Munich’s computing infrastructure is being overhauled as well.

Munich’s experience can be a lesson to the rest of us. What can we learn?

  • Test with infrastructure before rolling out.
  • Test with internal applications before rolling out.
  • Update or migrate infrastructure if needed.
  • Use a pilot migration to work bugs out before a full migration.

UPDATE: USA Today had an excellent article in 2003 describing the initial process whereby Linux was chosen for Munich – and Microsoft’s drastic measures to try to beat out Linux.

Building a Checklist

When you are undertaking an invasive and complicated process, you should have a checklist to go by. This will help you make sure you cover all the bases and don’t forget anything. I’ve written about this before.

However, how do you build a checklist that will be of the most assistance?

First, “build” is the right term: in the days or weeks leading up to your process (system maintenance, for example), come back to the checklist over and over. Review it several days in a row, or better yet, several times a day. You’ll think of new things to add to it, and you’ll be fleshing it out until it is comprehensive and complete. You might want to leave it loaded in your workstation so you can come back to it whenever the mood strikes.

Secondly, break the checklist down into major sections. For example, in patching a system you might have sections for: 1) preparing the system; 2) patching the system; 3) rebooting the system. Other processes will have different major sections. These major sections should be set apart on your checklist, preferably with titles and bars that segregate the checklist into its component parts. I recommend a different color background and a large bold font to set it apart.

Thirdly, there should be a “point of no return” – which should be at a major section break. This is the point where you cannot turn back and return to the way things were. At this point during the process, you have to choose: have things gone smoothly enough that completion is likely – even inevitable – or is the process in such disorder and disarray that a return to the status quo would be better? At that point, one must choose.

With such a checklist, your process will be much smoother, and you won’t have to explain to the boss why you missed something critical. It’ll also document what you did (along with the notes you take).

Planning for a system outage: 8 ways to make it easy

When I worked for a cabinetmaker for a number of months, I asked him about the various tools available and about which ones he used. One thing he said a lot was “that brand is okay, but in a production environment it won’t last.” That is, the cabinetmaking shop would go through the lightweight tools quickly and was different than a home woodworker.

So it is in system administration. When you have users counting on a system to be up, even a planned system outage is going to be extremely unpleasant and costly. Add up the cost of every worker not being productive, projects delayed, overtime paid, and so on. If the planned outage then becomes even longer than expected, you can see the costs begin to add up.

Thus, there are a number of things to keep in mind that would not matter in a non-production environment – an environment where a system outage means you don’t get to play Marble Blast Gold for a couple of hours. The details of planning an outage of a production system can make up an entire book, but here are some things to keep in mind:

  1. Communicate benefits. Set up meetings and show the users the benefit they will receive from the system outage. Don’t tell them that they’ll get more memory: tell them the system will be faster and respond quicker. Don’t tell them there’ll be more disk space: tell them they’ll be able to store more data.
  2. Plan for failure. Think of this as disaster recovery planning for the project – or as welcoming Murphy into your plans. Anything that can go wrong will – so plan ahead as to what you will do when it does.
  3. Minimize downtime. In whatever ways you can minimize downtime means cost savings to the company – cost savings that they can pass on to your or the customer. It also makes the higher-ups happy, which is always a good thing.
  4. Test – then test some more – then test again. Make trial runs and see if it works. Make detailed plans of what to do and what might happen. Test to see if things worked properly – then test again.
  5. Make backups! Backup the system just before the major change (just in case) and then back it up again just afterwards. Set aside these tapes – and in the meantime, keep the regular daily backup rotation going. Then if you have to roll back, you can.
  6. Make checklists. Sure, you didn’t miss anything the first time – but what about the second time? Can you replicate every step all the way through, without missing any and without doing anything different? Did you test everything or did you miss one? Make checklists – as David Allen would say, “Get it out of your mind!” (he’s right).
  7. Organize a schedule. When will the system be down? Let everybody know and discuss how long. Agree on a specific day and time.
  8. Decide on a pass-fail point. This could be thought of as a “point of no return” – if things are not going well or are not going according to schedule, what is the last moment (or last step) that you can successfully turn back and restore services as planned? Have one of these and stick to it. When that point is reached, determine whether there is room for error and whether everything is going well – or whether you must turn back.