System Reboots Require These Tools and Practices

When a long-running server needs to be rebooted, what are the most important tools? Remember, reboots on many systems can be weeks, months, or even years in between. So a reboot is not a normal occurence for the machine.

So what would the best tools to have on hand? Paper and pen. Take extensive notes of everything that happens out of the ordinary as the system comes up – things to fix, things to watch out for, and so on. Recording how much time it takes may not be a bad idea. Watch for services that are not required and shut them down as needed.

When debugging the reboot process, make sure to get evidence of a completely clean startup before considering the job done. The job may look like it is done, but if a reboot exposes a failure in configuration or other problems, then it’s not done – and you won’t know unless you reboot.

Also when you reboot, make sure that all subsystems are up and running. Often, important subsystems are not set to automatically start up – in case the system crashes, the idea is to keep the system off-line until the reason for its demise is fully known. So don’t forget these important subsystems and start them up after booting – whether the system is Caché or Oracle or some other.

Living in the Internet Cloud

When we are on-the-go professionals, and are potentially required to work from home or from other locations on the road, isn’t it good to be able to reach your data no matter where you are?

Thus is the interest in being able to “live in the cloud”, keeping data and information on Internet computers out there somewhere.  Unfortunately, it also means that instead of making our own backups, we must rely on someone else’s backups.  Suppose the company goes out of business?  This has already happened for several photo sites – and in one case, it took the customer’s photos with them.

There are many sites that can provide a safe harbour for data or for information of various kinds.  My favorites are these:

The online desktops Goowy and eyeOS deserve special mention.  Not only do they provide a desktop, but also all the standard applications you might need.  It is possible to run within one of these desktops and save your data entirely with one of these setups.  This makes for a fantastic central location for everything – and a larger-than-normal risk.

EyeOS has one more feature that most of these do not: it is open source.  If you want to run your own version of EyeOS, there’s no problem doing so.  This is incredibly useful if you have your own server to run this on.  Then you can centralize your information and retain control at the same time.

I also find the mail clients in Goowy and eyeOS to be quite useful for sending mail from anywhere with a browser.

Blogged with the Flock Browser

Tags: , , ,

New tools: pkill and pgrep

In Solaris 9, two new utilities were added: pkill and pgrep. These tools are perhaps old news to Solaris admins. However, these utilities were then quietly added to Linux in short order, and now show up in HP-UX 11i v3 and perhaps others. What do these tools do?

First of all, pkill is just a wrapper for pgrep. Well, what does pgrep do? It searches the current processes for a match based on arguments that you give to it.

The most common use would be to search for a command. The pgrep utility will then return each process ID, one per line. The pkill utility will send the default kill signal to each pid. The pgrep utility can search based on a large array of factors, including userid, groupid, virtual size, effective user ids, command name, full command string, and more.

Here’s an example:

# pgrep cache
13
12
14
15
33
49
22006
21950
21973
21976
#

When combined with scripting, these commands can be quite useful. Consider this ksh snippet:

for i in $(pgrep cache) ; do
  // do some commands here
done

Confidentiality

It’s easy to be casual about confidentiality and to miss the finer aspects of what a system administrator needs to be quiet about in daily dealings with users and customers. There are obvious confidentiality agreements and the usual corporate trade secrets and new products – but there are other things that we as admins must be wary of.

The obvious corporate secrets include things such as trade secrets, secret formulas, research, new product development, and more. Perhaps almost as obvious are customer data and patient information (in healthcare related industries).

However, what about the server break-in? Certainly anyone affected will have to be told – but how much? And what can be told around the water cooler?

A server compromise is a perfect example of an area where we as admins should know a lot more than we tell. Telling all of the details could alert the cracker to what is known and not known, and could potentially compromise any future legal action against the cracker. Talking could also affect any public relations that the company may wish to do in the wake of a serious event like this.

Other items include user privacy. Certainly users expect a certain measure of privacy – perhaps too much given the realities of administration today. However, what does an email administrator do when they find out that someone is cheating on their spouse? Answer: nothing. What do you say to others? Nothing. A lot of things fall into this category.

However, what if you as an admin find something that demands action – perhaps legal action? Pornography on a system, for instance – or hate mail – or spam being sent? Best thing is to send it up the chain of command in the most secure way: face to face – on a walk in the clear air if necessary. Document (on paper) as much as possible, and pass that up the chain of command as well – perhaps keeping a second copy just in case. Sign and date both. And after all this, what do you tell your coworkers? Nothing. Absolutely nothing. Also, never ever identify the person until all action is done – and perhaps not even then.

What about the new fancy security feature that got installed? It doesn’t matter if it is digital, physical, or otherwise: say as little as possible. When working at a bank, one learns this first-hand and quickly; data center security and server security are the same way. Of course, among your cohorts in the trenches, security is a shared topic – but it is not for public (or staff) consumption.

It is better to err on the side of caution and silence than to say too much – and in this digital age, any thing in email or on disk is already too much and can be recovered. To maintain the strictest confidentiality, don’t use digital means to talk about confidential matters.

The Landscape of Virus Writers

Initially, the virus writers among the programmers and hackers were hobbyists – or those engaged in research (though perhaps misguided or misapplied). Sometimes – or perhaps all the time – viruses would escape from their hosts and get sent into the wild.

One of the oldest had to do with the original game of Adventure: so many people wanted to play it, that it was decided in the local environment to automatically replicate Adventure on the user’s local machine before running. It was so wildly successful that it was everywhere – and then the counter-virus was written that would delete a copy of Adventure if the user no longer wanted it. (I’ll be durned if I can actually back up that story…. my memory doesn’t go back that far…)

Now it has been reported by The Register (and noted by darknet.org.uk and TrendLabs Malware Blog) that the virus writing club 29A is disbanding. Most virus writing groups of the past have been the equivalent of spray painters painting a building – or those that try to see how many places they can go in a building (building hackers?). Money was not the objective – prestige, honor, and popularity were all part of it.

Now with the demise of 29A, and the newly reported fact that adware has surpassed viruses as the largest current threat, it is becoming clear that the typical virus writer is changing – becoming more interested in profit and extortion.

Is this a fact worth bemoaning? Before, virus writers just wanted to wipe out a system – or propogate the virus as widely as possible. Now writers want to put the system into a botnet or to extort money from the owner. Which is better?

The hacker ethic states that you do no damage to a system. The earliest virus writers did their best to follow this – but virus writers haven’t followed that rule for many years.

It doesn’t make one sleep any easier at night knowing viruses are now the domain of the extortionist and not the spraypainter….

All I know is, I’ll spend whatever time and effort I can to keep them out. A production system cannot go down due to a virus, no matter if it is malignant or not.

Text Editors

Over at Hoff’s Blog, he recently discussed some thoughts he had about text editors. However, it was in the comments where I learned why TECO is so unstable on OpenVMS for Itanium. You wouldn’t think that anything in a shipping product would regularly do the equivalent of dump core – not to mention something as old and tested as TECO – but it’s true. Apparently TECO (being the old and unkillable beast that it is) wasn’t rewritten or debugged for Itanium, but rather was put on top of an emulation layer and translation layer, which seems to cause stability issues.

Thus, if you want a stable TECO (and who doesn’t want a stable editor?) you are better off getting a version of TECO-C and compiling it or installing it onto OpenVMS rather then using EDIT/TECO.

Believe it or not – I learned more about TECO before I learned about OpenVMS. And even among OpenVMS people, the TECO users must seem to be antiquarians or eccentrics! I just like the idea of programming my text editor…. (so I’m a little strange…)

Of course, vim and other more current editors are also available – and there are EDT clones for other machines if your inclination leans that way.

PS: Talking of editors – do any of you remember a text editor called NEAT? I seem to remember liking it, but I haven’t seen it since the Univac 1100/80 was decommissioned at the local university (a long time ago). And have you tried googling for a NEAT editor? (ouch!)

Writing HP-UX init scripts (and a tip!)

HP-UX has some nice features in their initialization scripts, but you have to be aware of them in order to take advantage of them.

One good starting point is the rc(1m) man page. The rc program is the actual program that does all of what we can initialization: that is, all of the startup processes, the rc.d directories, the symbolic links, the run levels – rc does all this. It is rc that init(1) calls to make it happen.

Also, the init scripts are not where you would first look – or second. The scripts live in /sbin/init.d. This directory contains all of the scripts (and no links). Then look at one of the scripts in order to determine what can be done with your own scripts – and to see if there are any new features.

There are several features of the HP-UX init script mechanisms that you can take advantage of:

  • The /etc/rc.config.d directory – to set configurable parameters for each individual subsystem (usually as represented by a script)
  • The startup and shutdown messages: these are formatted nicely and make for quick viewing of the startup or shutdown process
  • Results: not just SUCCEED or FAIL but also N/A (Not Applicable) and others.

To use the /etc/rc.config.d directory, just source the appropriate configuration “script” into your environment. Then the appropriate variables will be set.

The startup and shutdown messages are exquisitely simple: instead of start and stop, the routine is called with start_msg and stop_msg. Respond to these subcommands by printing the appropriate text for a descriptive message. With these set, your init script will display its appropriate message when it is started or stopped during startup or shutdown.

Then the results of the script – which show up in the HP-UX startup or shutdown in the far right hand side of the line. The possible results are:

  • (0) Subsystem was brought up (or down) successfully.
  • (1) Subsystem encountered errors.
  • (2) Subsystem was skipped, such as from configuration in the /etc/rc.config.d file or other reasons, and did nothing.
  • (3) A reboot of the system is requested; rc(1m) will perform the actual reboot. /etc/rc.bootmsg will be displayed on the console prior to reboot, then deleted.
  • (4) Subsystem was successful and started a process in the background.

Any return value larger than 4 is the same as 1: subsystem encountered errors.

Note, too – there is nothing that mandates that these must be shell scripts (except perhaps for rc.config.d file syntax). If you must, it could be a perl script – although a shell script is best. Just make sure to have the proper header so that the system doesn’t try to interpret your perl script as a shell script.

And a quickie tip for you! If you find yourself using a shell and are expecting to edit the history using vi – only to find that there is no history – you can find that your display no longer responds (!). In reality, the system is responding just fine, except you can’t see it (small things, right?). This is because the editing sequence for vi is ^[k (ESCape-K). This mucks up the display as it is not being handled by the shell but rather by the terminal display.

How to get out? I use the following command sequence: first, do a set command – then another. How does this work? Well, I can’t explain everything, but the first output from set will contain the ^[k sequence in it (look in the variable $_). My theory is that that is not the only place to find it. It also does not seem to matter what the second command is, just one with output (and not just a carriage return).

Anyone who feels the need for a challenge can research this – I just know it works, and can be typed very fast.

Setting User Expectations (or the Scotty Factor)

When a user is affected by something we do, it is easy to miss the mark and let the user have unrealistic expectations without them or us realizing it. When this happens, the expectations are inevitably dashed and the user (for better or worse) blames the administrator (and not entirely without cause).

A good example is response time. When a new server is brought in, and the user application is moved from the old server to the new, it is easy for a user to think that all their response time troubles will be solved, things will be extremely fast, and there will be no waiting.

It is up to us as administrators (and customer service personnell!) to educate the user that some problems may be solved, and the exact response won’t be known until it runs on the new hardware. We know better than anybody how a slow disk could bring down the fast server, or an application bug could surface that was hidden or non-existant on the old hardware, or the new hardware is not completely supported by the application – there are any number of things that can affect response time.

Another is server downtime. When downtime is needed, it is not enough for us to post it to the central web page that details all of the outages. You know that not every user will read it – and maybe not even most. Let the users know that downtime is to be expected, and let them know that it will take this many hours (take your best estimate, then triple it).

This brings us to the Scotty Factor. The Wikipedia article on Montgomery Scott explains it fully:

The term ‘Scotty factor’ describes the practice of over-estimating how much time a project will require to complete by multiplying the actual estimate by a particular number. In strict terms it is a factor of four: the number cited by Scotty in the film Star Trek III: The Search for Spock.

So next time you can see your customers (users) building expectations – make sure that they build the right expectations.

Syslog: “could not change the compartment” (HP-UX)

There is recurring error that keeps putting a message into the syslog.log file every time a cron job runs under HP-UX 11i v3. This message can safely be ignored, but if you run a lot of cron jobs it can keep filling your syslog with useless messages. The message like thus:

Feb 28 16:26:00 mysys syslog: Cron daemon - could not change the compartment for the job /usr/bin/foo

These messages will continue to appear for every cron job. It refers to a product which in this case is not installed on the machine (that is the “compartments” of which it speaks).

There is a patch available from HP to fix this, but even after a year this patch remains off of the recommended list. The patch is PHCO_36539 and it will take care of the problem, returning your cron messages to a sane level.

The Boot Process (FreeBSD)

When the system starts, it sometimes will fail to load the kernel – or perhaps there are other adjustments that must be made. It pays, thus, to know exactly how the system gets to the point of loading the kernel – even before initd runs or swapper runs.

It can also be a benefit in trying to get the system to use a serial console or to provide splash screens throughout.

In a FreeBSD/x86 system, the process goes something like this:

  1. The system turns on, and the BIOS begins processing and preparing hardware for the boot up. Most Intel machines do not provide a serial port view of the BIOS boot process.
  2. The system loads the first block (the master boot record, or MBR) from disk.
  3. The disk loader then loads the rest of the loader. It is at this point that the first screen appears. This loader may be GRUB or the FreeBSD loader.
  4. The next step is normally to load the FreeBSD boot loader (which is different from the FreeBSD MBR loader). The boot loader provides a FORTH-based environment for modifying the boot sequence, and so on.
  5. The final step is to load the kernel and to start processing. GRUB could load the kernel directly, but using the boot loader provides access to the prompt and to modify the boot process.

GRUB can be configured for a serial port, as can the FreeBSD kernel and loader. Likewise, the splash screens can be set here as well.

It is also possible that any of these items may fail; knowledge of the others will provide for methods of recovery.

If GRUB fails to load the loader (or kernel) properly, it can be adjusted interactively to load the right loader with the right parameters.

If the FreeBSD boot loader does not load the kernel properly (or is misconfigured), it can be adjusted by pressing a key and using the FORTH prompt to manipulate the loaded kernel – including loading a different kernel, changing parameters, loading modules, and so forth.

Follow

Get every new post delivered to your Inbox.

Join 39 other followers