pcp – UNIX Administratosphere

Statistical Analysis is Valuable for Understanding

In System Administration – and many other areas – statistics can assist us in understanding the real meaning hidden in data. There are many places that statistical data can be gathered and analyzed, including from sar data and custom designed scripts in Perl or Ruby or Java.

How about the number of processes, when they are started, when they finish, and how much processor time they take over the length of time they operate? Programs like HP’s Performance Agent (now included in most HP-UX operating environments) and SGI’s fabulous Performance CoPilot can help here. In fact, products like these (and PCP in particular) can gather incredibly valuable sorts of data. For example, how much time does each disk spend above a certain amount of writing, and when? How much time does each CPU spend above 80% utilization and when?

Using statistical data from a system could, with the proper programming, be fed back into a learning neural network or a bayesian network and provide a method of providing alarms for stastically unlikely events.

There are other areas where statistical analysis can provide useful data than just performance. How about measuring the difference between a standard image and a golden image based on packages used? How about analyzing the number of users that use a system, when they use it, and for how long? (Side note: I had a system once that had 20 or 30 users that each used the system heavily for one straight week or two in every six months… managing password aging was a nightmare…)

There are many places for analyzing a system and providing statistical data; this capability, however, has not been utilized appropriately. So what are you waiting for?

Three Technologies We Wish Were in Linux (and More!)

Recently, an AIX administrator named Jon Buys talked about three tools he wishes that were available in Linux. Mainly, these technologies (not tools) are actually part of enterprise class UNIX environments in almost every case.

One was a tool to create a bootable system recovery disk. AIX calls the tool to do this makesysb; in my world – HP-UX – this is called make_tape_recovery. In HP-UX, this utility allows you to specify what part of the root volume (vg00) to save and other volumes. Booting the tape created from the make_tape_recovery utility will allow you to recreate the system – whether as part of a cloning process or part of a system recovery.

Another technology missing from Linux is the ability to rescan the system buses for new hardware. In Jon’s article, he describes the AIX utility cfgmgr. HP-UX utilizes the tool ioscan to scan for new I/O devices. Jon mentions LVM (which has its roots in HP-UX) but this does not preclude scanning for new devices (as any HP-UX administrator can attest).

Jon then discusses Spotlight (from MacOS X) and laments that it is missing from Linux. Linux has Beagle and Tracker, and all are quite annoying and provide nothing that locate does not – and on top of this, locate is present on AIX, HP-UX, Solaris, and others. I for one would like to completely disable and remove Spotlight from my MacOS X systems – Quicksilver and Launchbar are both better than Spotlight. In any case, all of these tools don’t really belong on an enterprise-class UNIX system anyway.

As for me, there are some more technologies that are still missing from Linux. One is LVM snapshots: while they exist in Linux, they are more cumbersome. In HP-UX (the model for Linux LVM) a snapshot is created from an empty logical volume at mount time, and the snapshot disappears during a dismount. In Linux, the snapshot created during logical volume create time (whatever for??) and then is destroyed by a logical volume delete. The snapshot operation should mirror that of HP-UX, which is much simpler.

Another thing missing from Linux which is present in every HP-UX (enterprise) system is a tool like GlancePlus: a monitoring tool with graphs and alarms (and the alarms include time-related alarms).

Consider an alarm to send an email when all disks in the system average over 75% busy for 5 minutes running. This can be done in HP-UX; not so in a standard Linux install. There are many others as well.

Personally, I think that Performance Co-Pilot could fill this need; however, I’m not aware of any enterprise class Linux that includes PCP as part of its standard supported installation. PCP has its roots in IRIX from SGI – enterprise UNIX – and puts GlancePlus to shame.

Perhaps one of the biggest things missing from Linux – though not specifically related to Linux – is enterprise-class hardware: the standard “PC” platform is not suitable for a corporate data center.

While the hardware will certainly work, it remains unsuitable for serious deployments. Enterprise servers – of all kinds – offer a variety of enhanced abilities that are not present in a PC system. Consider:

Hot-swappable hard drives – i.e., hard drives that can be removed and replaced during system operation without affecting the system adversely.
Hot-swappable I/O cards during system operation.
Cell-based operations – or hardware-based partitioning.

For Linux deployment, the best idea may be to go with virtualized Linux servers on enterprise-class UNIX, or with Linux on Power from IBM – I don’t know of any other enterprise-class Linux platform (not on Itanium and not on Sparc) – and Linux on Power may not support much of the enterprise needs listed earlier either.

What are your thoughts?

Preventing Problems (or: How to Appear Omiscient to Your Users!)

When a user comes to you with problems that they are experiencing with one of the servers you manage, what is the first thing that goes through your mind (aside from “How may I help you?”). For me, there are two: “How can I prevent this from happening again?” and secondly, “Why didn’t I know about this already?”

Let us focus on the second of these. If a user is experiencing problems, you should already know – yes, you really should. If the server is down, overloaded, or lagging behind, these are the sorts of things you should already know.

Most servers leave messages in the system syslog or other log files; write or use something that will scan the log files for appropriate entries and send you a warning. SEC (Simple Event Correlator) is one of the best at this.

Another tool that is invaluable for this is Nagios or other monitoring software such as Zabbix or Zenoss. With such software, it is possible to be notified when a particular event occurs, an actual threshold passed.

When a tool like Nagios is combined with SEC, then much more powerful reporting is available. For example, if a normally benign error (ugh! Who said errors were normal?) occurs too many times in a period of time, then the error can be reported to the Nagios monitoring software and someone notified.

Other tools provide system monitoring with time-related analysis. For example, if disk utilization is too high for too long, a warning can be issued. Another example: if too many CPUs average more than 60% utilization for the last 30 seconds, someone could be notified.

HP’s GlancePlus (a part of OpenView which comes bundled with 11i v3) and the now open source tool Performance Co-Pilot (or PCP) from SGI are two that provide these capabilities. They support averaging, counts per minute, and many, many more. PCP comes with support for remote monitoring, so all systems can be monitored (and data archived) in a central location.

Again, these tools can be integrated with SEC or Nagios to send out notifications or post outage notices and so forth.

With tools like these in your arsenal, next time someone comes to you with an outage or sluggish performance complaints, your response can be: “Yes, I’m already working on it.” Your users will think you omniscient!