Unit Testing and System Administration

Unit tests are a programmer’s best friend – they help the programmer to fix bugs and keep them fixed, by continually testing to make sure that the bug is fixed.

In administering a system, certain services must be available, and certain products must be installed and configured. Install processes like HP-UX Ignite or Solaris Jumpstart can help, as can products like cfengine.

However, a unit test environment can be of great use to make sure that all went according to plan. Nagios is the best known of these – yes, a unit tester. Consider: do you need NFS? Test for an NFS volume. Do you require a database to be up? Check via TCP that the server’s connection is available.

In addition, if you’ve bounced a server only to find that you forgot something: create a unit test for it (that is, create a check in Nagios). If an active check won’t work, use a passive check: a check that runs on the server and reports back to Nagios.

If you continue to add checks as you think of them or encounter trouble, eventually you will find that you are much more in tune with your servers. Don’t forget to add NagiosGrapher to get the benefit of performance history too. With both Nagios and NagiosGrapher, you’ll be all set.

Powered by ScribeFire.

Automation: Live and Breathe It!

Automation should be second nature to a system administrator. I have a maxim that I try to live by: “If I can tell someone how to do it, I can tell a computer how to do it.” I put this into practice by automating everything I can.

Why is this so important? If you craft every machine by hand, then you wind up with a number of problems (or possible problems):

  • Each machine is independently configured, and each machine is different. No two machines will be alike – which means instead of one machine replicated one hundred times, you’ll have one hundred different machines.
  • Problems that exist on a machine may or may not exist on another – and may or may not get fixed when found. If machine alpha has a problem, how do you know that machine beta or machine charlie don’t have the same problem? How do you know the problem is fixed on all machines? You don’t.
  • How do you know all required software is present? You don’t. It might be present on machine alpha, but not machine delta.
  • How do you know all software is up to date and at the same revision? You don’t. If machine alpha and machine delta both have a particular software, maybe it is the same one and maybe not.
  • How do you know if you’ve configured two machines in the same way? Maybe you missed a particular configuration requirement – which will only show up later as a problem or service outage.
  • If you have to recover any given machine, how do you know it will be recovered to the same configuration? Often, the configuration may or may not be backed up – so then it has to be recreated. Are the same packages installed? The same set of software? The same patches?

To avoid these problems and more, automation should be a part of every system wherever possible. Automate the configuration – setup – reconfiguration – backups – and so forth. Don’t miss anything – and if you did, add the automation as soon as you know about it.

Things like Perl, TCL, Lua, and Ruby are all good for this.

Other tools that help tremendously in this area are automatic installation tools: Red Hat Kickstart (as well as Spacewalk), Solaris Jumpstart, HP’s Ignite-UX, and OpenSUSE Autoyast. These systems can, if configured properly, automatically install a machine unattended.

When combined with a tool like cfengine or puppet, these automatic installations can be nearly complete – from turning the system on for the very first time to full operation without operator intervention. This automated install not only improves reliability, but can free up hours of your time.

(Not) Installing OpenSolaris 200805 onto a Compaq nc4010

Solaris is by all accounts a great operating system (I continue to think so) but OpenSolaris 200805 on this laptop does not show any of the excellence that Solaris is supposed to have.

I have tried Solaris x86 in the past, including installing Solaris 2.6 onto an aging 486, and installing Solaris 8 onto several different machines, including laptops. None of these installs have had as many problems as installing OpenSolaris 200805 onto this machine. Installing OpenSolaris 200805 into a VirtualBox virtual machine was slick; not so this system. (I still don’t know why a complete install description is required for virtual environments; it’s just another computer system after all.)

First, I installed OpenSolaris to a physical hard drive using the VirtualBox machine to do so. This worked beautifully. Installed, no problem.

However, booting the installed operating system provided a big problem: apparently the root filesystem definition is buried in the filesystem itself (ZFS) so that booting the disk from anywhere else in the system causes the boot to fail. This is not the problem – the problem is trying to find out how to fix it. With Linux, a kernel parameter and a fix to /etc/fstab is all that is needed.

In searching for the answer to this, there were a number of stumbling blocks – obvious ones – and there seemed to be no one who had answered this problem properly:

  • Boot into Failsafe mode and… When I see that, I always wonder what operating system they’re using: OpenSolaris 200805 has no failsafe mode. (Later on, I found out that OpenSolaris 200805 was the first Solaris to not have a failsafe mode…. nice.) This is not helpful, and rules out a majority of the responses right off the bat.
  • OpenSolaris 200805 uses ZFS as the root filesystem. This means that a) it is new and not well-tested; and b) most answers to this problem are irrelevant as they are assuming UFS as the root filesystem, not ZFS.

Having had such problems just getting the stupid drive to boot, I gave up: I tried to install directly, using a 3.5″ USB disk caddy with a CD/DVD ROM player in it. The system will boot from this, but the speed was very slow.

The first try resulted in the machine freezing at about 22% done. After rebooting, the system would continually hang right after the initial SunOS boot text. I was able to fix this (after many reboots and freezes) by booting into Linux and overwriting the half-baked install on the internal disk. Thus, the pre-existing data on the internal disk (unused) was enough to cause OpenSolaris to freeze up (I’d used the “entire disk” install option – which presumably wipes the DOS-style partition table clean off the drive).

The second try resulted in a complete install, but that was it. No reboot ever succeeded there after. The system froze first at the “zfs0 is …” text, then at “tz0 is …”, then another one. Trying the option “-B acpi-user-options=0x8” permitted the machine to boot long enough to shut itself off!

About then is when I decided I’d had enough. Maybe Solaris Express or Belenix will work, but OpenSolaris is extremely poor in this department – which is so disappointing. Did I mention that OpenSolaris does not support JumpStart installs either?

With this sort of track record, I cannot recommend OpenSolaris for laptops – nor for production x86 servers. Sad really – I’d been looking forward to getting OpenSolaris on one of my laptops – very much, as a matter of fact.