Automation: Live and Breathe It!

Automation should be second nature to a system administrator. I have a maxim that I try to live by: “If I can tell someone how to do it, I can tell a computer how to do it.” I put this into practice by automating everything I can.

Why is this so important? If you craft every machine by hand, then you wind up with a number of problems (or possible problems):

  • Each machine is independently configured, and each machine is different. No two machines will be alike – which means instead of one machine replicated one hundred times, you’ll have one hundred different machines.
  • Problems that exist on a machine may or may not exist on another – and may or may not get fixed when found. If machine alpha has a problem, how do you know that machine beta or machine charlie don’t have the same problem? How do you know the problem is fixed on all machines? You don’t.
  • How do you know all required software is present? You don’t. It might be present on machine alpha, but not machine delta.
  • How do you know all software is up to date and at the same revision? You don’t. If machine alpha and machine delta both have a particular software, maybe it is the same one and maybe not.
  • How do you know if you’ve configured two machines in the same way? Maybe you missed a particular configuration requirement – which will only show up later as a problem or service outage.
  • If you have to recover any given machine, how do you know it will be recovered to the same configuration? Often, the configuration may or may not be backed up – so then it has to be recreated. Are the same packages installed? The same set of software? The same patches?

To avoid these problems and more, automation should be a part of every system wherever possible. Automate the configuration – setup – reconfiguration – backups – and so forth. Don’t miss anything – and if you did, add the automation as soon as you know about it.

Things like Perl, TCL, Lua, and Ruby are all good for this.

Other tools that help tremendously in this area are automatic installation tools: Red Hat Kickstart (as well as Spacewalk), Solaris Jumpstart, HP’s Ignite-UX, and OpenSUSE Autoyast. These systems can, if configured properly, automatically install a machine unattended.

When combined with a tool like cfengine or puppet, these automatic installations can be nearly complete – from turning the system on for the very first time to full operation without operator intervention. This automated install not only improves reliability, but can free up hours of your time.

Preventing Problems (or: How to Appear Omiscient to Your Users!)

When a user comes to you with problems that they are experiencing with one of the servers you manage, what is the first thing that goes through your mind (aside from “How may I help you?”). For me, there are two: “How can I prevent this from happening again?” and secondly, “Why didn’t I know about this already?”

Let us focus on the second of these. If a user is experiencing problems, you should already know – yes, you really should. If the server is down, overloaded, or lagging behind, these are the sorts of things you should already know.

Most servers leave messages in the system syslog or other log files; write or use something that will scan the log files for appropriate entries and send you a warning. SEC (Simple Event Correlator) is one of the best at this.

Another tool that is invaluable for this is Nagios or other monitoring software such as Zabbix or Zenoss. With such software, it is possible to be notified when a particular event occurs, an actual threshold passed.

When a tool like Nagios is combined with SEC, then much more powerful reporting is available. For example, if a normally benign error (ugh! Who said errors were normal?) occurs too many times in a period of time, then the error can be reported to the Nagios monitoring software and someone notified.

Other tools provide system monitoring with time-related analysis. For example, if disk utilization is too high for too long, a warning can be issued. Another example: if too many CPUs average more than 60% utilization for the last 30 seconds, someone could be notified.

HP’s GlancePlus (a part of OpenView which comes bundled with 11i v3) and the now open source tool Performance Co-Pilot (or PCP) from SGI are two that provide these capabilities. They support averaging, counts per minute, and many, many more. PCP comes with support for remote monitoring, so all systems can be monitored (and data archived) in a central location.

Again, these tools can be integrated with SEC or Nagios to send out notifications or post outage notices and so forth.

With tools like these in your arsenal, next time someone comes to you with an outage or sluggish performance complaints, your response can be: “Yes, I’m already working on it.” Your users will think you omniscient!

Sparse files and Virtual Machines

Sparse files are files that take up less space on disk than they actually use. How is this possible? Any blocks with zeros in them are not stored but are silently skipped. When the system retrieves these blocks later, it returns a block of zeros – and if data is put into the block, it is saved onto the physical disk appropriately.

Sparse files require file system support; if the file system you use doesn’t support sparse files, then you have no recourse but to store every file in the normal fashion. Notable filesystems without support for sparse files are any FAT filesystem (MSDOS), HPFS (a FAT derivative used by OS/2), HFS+ (the Macintosh filesystem), and OCFS v1 (the Oracle Cluster File System). Modern file systems such as NTFS (Windows NT Filesystem), VxFS (Veritas File System), GFS, XFS, JFS, ext2, ext3, Reiser3, Reiser4, and many others all support sparse files. No word on whether or not ODS-5 (OpenVMS) supports sparse files or not.

Sparse file support is valuable for virtual machines as a virtual hard drive by its necessity will be of significant size – but by its nature will also have a lot of empty space.

However, over time, data gets written to the virtual machine hard drives then deallocated. The data remains in these blocks, and the blocks remain on disk. These blocks accumulate, and the file expands – filled with data that is no longer used.

The only way to free these unused blocks is to zero them out and copy the file as a sparse file. You might also want to defragment the disk if this is relevant to your virtual machine’s operating system.

First, zero the unused blocks. Typically, this is done with an erasing program suitable to the operating system in the virtual machine. Make sure that the program is a) erasing only unused data! and b) zeroing out the data, not using random patterns or other wipe patterns.

Once the program is done wiping, shut down the virtual machine. (Yes, this process means downtime.) Copy the original file to the new file using the appropriate flags to make the new file sparse. For Linux (using GNU cp) one could do this:

cp --sparse=always oldvm newvm
rm -f oldvm
mv newvm oldvm

These steps will replace the old VM file with a new sparse VM file that uses much less space.

Tips for SSH

There are a couple of very interesting things about SSH that I never knew before – and I think you’ll find them interesting as well. Over at nixshell, there was a pointer to an article (by Nico Golde) about creating port forwarding allocations on the fly over an existing connection.

To do this (from an existing networked SSH session), use the ~ escape character like so:

~C

The results can look like this:

$ 
ssh> -h
Commands:
      -L[bind_address:]port:host:hostport    Request local forward
      -R[bind_address:]port:host:hostport    Request remote forward
      -KR[bind_address:]port                 Cancel remote forward

$

In this case, I entered ~C then the “command” -h for help. To enter a command at this prompt, it is necessary to press ~C each time; after the help screen came out I pressed the enter key and you see the UNIX prompt returned.

However, there is a lot more. Going further in nion’s blog (the source), there is also a post about using the ControlMaster and ControlPath options to reuse an existing SSH connection for much faster connections. This sounds exciting and is something I’ll have to investigate myself. Expect to see more on this later.

Free Software Foundation Files Suit Against Cisco

This is incredible news. The behemoth Cisco has apparently not been in compliance with the GPL License (in relation to their Linksys routers for one), and one problem after another seems to have been cropping up as the Free Software Foundation (FSF) tried to resolve each one.

Finally, the FSF saw no recourse but to finally file a lawsuit to get them to resolve all of the issues and released a press release to that effect. The FSF gives more details in this article. The complaint filed by the Software Freedom Law Center (who announced the filing on their on site) on behalf of the FSF is also available.

The news is spreading far and wide: already, there are articles in InformationWeek, InternetNews, and NetworkWorld. It’s also already on Slashdot, and a Wikipedia page is aging nicely already. (Side note: it’ll be interesting to see how gnu.org handles the slashdot effect…. but I digress.)

I can’t wait until the folks at Groklaw get their hands on this; will be interesting (and will update with the results when it happens).

Lastly, if you believe in what the FSF has been doing, why not join today?

Helios Linux Attacked as Illegal Enterprise

I saw this article from Ken Starks, the maintainer of the Helios Linux distribution, about a letter he received. It is from a teacher who confiscated a number of live Linux CDROMs from a student and then accused the Helios maintainer of illegal activities. The teacher’s letter is astounding in its misunderstanding of the true nature of open source.

Setting aside the audacity and ignorance of the teacher for this article…. It goes to show that not everyone is as well-informed as many of us. The teacher in this case perhaps has never heard of Edubuntu, a distribution formed just for education – nor of OLPC, a nonprofit organization trying to get laptops (Linux laptops mind you) into the hands of all of the children of Africa and the third world.

We must be prepared for educating our supervisors, users, and others that rely on us as to why this or that open source project is worthwhile. In many cases, the fact that a product is open source (or not) is not a selling point: many folks will not use something because it is open source, but would rather pay for something which is better – or meets their needs – or is “what everyone uses.”

Examples of this abound: Linux v. Windows – Linux v. UNIX – Red Hat Enterprise v. CentOS – OpenOffice v. Microsoft Office – OpenSSH v. SSH – GNUCash v. Quicken – and more. Put aside the open source nature of the product and explain why it is better than the commercial product. Does it have more features? Does it work in more places? Is it easier to use? Does it cost less? (Okay, the last is not free of the open source movement – but freeware is there too…) Does it have a lighter footprint? Is it more widely used than the commercial product?

All of this must be explained to those who have no idea what open source is about – and perhaps have no technological background, much less an understanding of technical history.

Let’s get out there with our heads held high and educate the masses!

Update: this story has a happy ending. I’m also glad he didn’t name the teacher involved, and I can just imagine the vitriol that flew his way. The fact that he stood his ground speaks tremendously to his character. Kudos, Ken!

Laptop “Disaster Recovery”

Over at the Productivity501 blog, there is a good article about laptop contigency planning. It is a must read. Go read it!

I’d like to take this one step further. Here in Wisconsin, we are having one back-breaker of a snowstorm (one and a half days so far). Closings everywhere – and people are looking to use the corporate VPN to work from home.

Here are some things to do to prepare for this ahead of time:

  • Make sure your certificate is current. You don’t want to find out your certificate is expired when you are desperately trying to get in.
  • Have you tried the VPN already? Does it work? When you are buried in snow and can’t reach the help desk is not the time to find out your software doesn’t work.
  • Try accessing everything you need to use. Is it responsive? Does it work? What are the quirks? If it’s slow, you can plan a backup strategy; if it’s not slow, you’ll know it’s not your machine when the VPN slows to a crawl.
  • Try accessing the VPN from where you would be when the snow flies (or wherever you would be when disaster strikes). Some ISPs have restrictive policies that will prevent your laptop from working if you are visiting someone. Try it first and find out how to solve any problems ahead of time.
  • Do you have your laptop with you? It won’t do you any good if you are caught without it when you need it. Do you have charging cords? Network cables? Wireless cards? Cellular phone modems? And test the connections!
  • Create backup plans. For all your careful planning, your laptop and Internet connection have gone south. Now what? Most likely, you’ll need phone numbers of your boss and coworkers, pager numbers, and other such things.

With this wintery weather upon us, it will be very important to be ready if you have to do your admin work from home (or on the road).

HP-UX Expanded (Long) Hostnames

Hostnames (without domain names) in UNIX were traditionally limited to eight character names. This limitation remains present in standards compliant uname(2) system calls.

HP-UX 11i v3 introduced the ability to use up to 256-character hostnames; however, the capability is turned off in the system as installed. HP-UX 11i v3 also introduced two kernel tunables in order to handle the long filenames:

  • uname_eoverflow. The setting of this boolean parameter determines what happens when uname gets a long hostname (and it isn’t set up for long hostnames). If this kernel tunable is true (1), then generate an error EOVERFLOW; if the tunable is false (0), then silently truncate the host name before returning it in the uname system call.
  • expanded_node_host_names. The setting of this boolean parameter affects whether long host names (and node names) are supported. If this tunable is set to true (1), then the host and node names can be up to 255 characters (actually, 255 bytes); if set to false (0), then the limits are 8 bytes and 64 bytes for the host and node names respectively. The actual current host name (or node name) is unaffected, no matter the setting.
  • .

To enable the long host name support in HP-UX 11i v3, do thusly:

kctune expanded_node_host_names=1

The change takes effect immediately. Do be acutely aware that there may be programs that are not expecting extended names may break or react in unexpected ways.