Posts filed under 'Debugging'

System Reboots Require These Tools and Practices

When a long-running server needs to be rebooted, what are the most important tools? Remember, reboots on many systems can be weeks, months, or even years in between. So a reboot is not a normal occurence for the machine.

So what would the best tools to have on hand? Paper and pen. Take extensive notes of everything that happens out of the ordinary as the system comes up - things to fix, things to watch out for, and so on. Recording how much time it takes may not be a bad idea. Watch for services that are not required and shut them down as needed.

When debugging the reboot process, make sure to get evidence of a completely clean startup before considering the job done. The job may look like it is done, but if a reboot exposes a failure in configuration or other problems, then it’s not done - and you won’t know unless you reboot.

Also when you reboot, make sure that all subsystems are up and running. Often, important subsystems are not set to automatically start up - in case the system crashes, the idea is to keep the system off-line until the reason for its demise is fully known. So don’t forget these important subsystems and start them up after booting - whether the system is Caché or Oracle or some other.


4 comments 28 March 2008

Generating a coredump (gcore)

If you wish to examine a runaway program outside of its element, you may choose to use the utility gcore. This utility is found in Solaris, Linux, and HP-UX, and perhaps others. The program syntax is:

gcore [ -o corename ] pid

The pid is the process id of the process to dump core, and the corename is the base of the filename to use for the core dump - the full name is the base name plus period (”.”) and the process id number. The default is to use “core“.

HP-UX systems will accept multiple process ids instead of just one. Solaris has several additional flags (as well as multiple pids). The additional Solaris flags won’t be covered here.

Once core has been dumped, the program continues operation; it does not stop. Thus, gcore is especially useful for taking a snapshot of a running process.

For example, consider a program with the process id 6674:

gcore 6674

This command generates a core file in the current directory with the name “core.6674“. This file then can be read by the GNU debugger gdb. Solaris also provides the dbx(1), mdb(1), and pstack(1) utilities. HP-UX provides gdb as well as the HP adb(1) utility. Both Solaris and HP-UX provide a core management utility coreadm(1m) - which is a topic for another day.

This article has an excellent description of working with core files in Solaris.


Add comment 16 January 2008

What to do when the system libraries go away…

You’ve been hacking away at this system (let’s be positive and upbeat and say it’s a test system and not production). Through a slip of the fingers, you move the system libraries out of the way - all of them. Now nothing can find the libraries. Now what? Is everything lost?

Don’t despair! You can do a lot without libraries. Already loaded software has the libraries in memory, so that is okay. This includes the shell, so the shell should be okay.

There may be some statically compiled binaries on the system that don’t require libraries; these can be run. If a scripting language like perl or ruby is statically compiled, then all is well - these languages can do anything, and can replace binaries (temporarily) such as mv, cp, and others. However, since vi is probably not statically linked, you may have to do it at the command line (and not in an editor).

Here are some things one can do:

echo *

Through the use of the shell’s filename expansion, this works out to a reasonable imitation of ls (ls -m, in particular). If you have to empty a file (make the contents nothing), use this command:

> file

Every standard utility today is dynamically linked; this means that in situations like these you are stuck with only what the shell itself provides. Remember that things like cat, ls, mv, cp, vi, rm, ln, and so on are all system executables - and quite possibly dynamically linked.

The best thing for a situation like this is to have prepared in advance - have a copy of busybox handy, and possibly a statically compiled perl or ruby (or both). Don’t forget editors - either have a copy of e3 or of a statically compiled editor. Busybox provides all the standard utilities in one statically created binary, and e3 is an editor that is tiny (and i386-specific) which emulates vi, pico, wordstar, and emacs (based on its name).  Neither busybox nor e3 require additional libraries.

A good tool (and a good tool in case of security breach) is a small CDROM of tools, all statically linked for your environment. Such a disk requires no libraries at all - and could have all of these necessary tools and more.

Of course, the best thing is to avoid doing this kind of thing in the first place…


Add comment 9 January 2008

SystemTap (and DTrace)

SystemTap is one amazing piece of work - it is a programmer-friendy and admin-friendly interface to KProbes (which are included in the Linux 2.6 kernel).  When you compare its capabilities to what has gone before, it is truly amazing.  Here are some of the things you can do:

  • Quantify disk accesses per disk per process (or per user)
  • Quantify the number of context switches that are a result of time outs
  • List all accesses to a particular file and the process that accessed it

This is only the tip of the iceberg. There is a wiki with more details, including “war stories.”  There is a language reference there as well.

There was an excellent article in Red Hat Magazine, “Instrumenting the Linux Kernel with SystemTap” by William Cohen.

One controversy that came up was that the initial impetus for creating SystemTap was to implement something like Sun’s DTrace for Solaris but under the GNU Public License.  Solaris and DTrace are licensed with Sun’s Common Development and Distribution License (CDDL), which many feel makes DTrace incompatible with the GPL-licensed Linux kernel.

Apparently, the CDDL is also incompatible with the BSD-licensed FreeBSD, as FreeBSD 7.0 will not have DTrace either.  There appears to be some licensing issues.

According to the Wikipedia entry on the CDDL, it was designed to be both GPL-incompatible and BSD-incompatible.  With regard to the GPL, the entry suggests that Sun never clarified why; as to the BSD, Sun did not want Solaris to wind up in proprietary products - which the BSD license allows.

On a brighter note, Eugene Teo was able to get the SystemTap tool to work on the Nokia N800.  The article seems to be behind a wall at LiveJournal; the article is still in Google’s cache.  However, it does requires some amazing convolutions:

  • A kprobes-enabled kernel must be installed on the N800
  • The SystemTap programs (like stap) must be installed on the N800
  • Any traces must be cross-compiled on another host
  • The kernel module thus created must be moved to the N800
  • Once the kernel module is in place, then the trace can be done.

So every desired trace requires precross-compilation on a desktop (sigh)…  Oh, well.

There is even a GUI for SystemTap in the works.


1 comment 4 December 2007

OpenSolaris on a MacBook

OpenSolaris is very interesting, and since the introduction of dtrace and ZFS has enthralled many. I tried to install it onto my HP Compaq E300 laptop (which it was unsuitable for), and tried to install it onto an HP Compaq 6910p laptop. In this case, the networking was unsupported: both the ethernet and the wireless drivers were not included with OpenSolaris Express (Developer Edition).

In any case, I expect I might just be shopping for a laptop in the next year - and it’s nice to see that OpenSolaris does run on the Apple MacBook.  This article goes into detail about how the writer got it to work, and each of the steps that were taken to make it happen.  Paul Mitchell from Sun discusses dual-partitioning a MacBook in this context as well.  Alan Perry (also from Sun) had done the same thing with a Mac Mini, and Paul extended it to the MacBook.  Both entries are detailed and have to do with MacOS X and Solaris dual-booting.

An a different note, check out the graph of library calls from dtrace in this article.  From what I’ve heard of dtrace, it’s the ultimate when it comes to debugging…


Add comment 22 November 2007

5 reasons to want a core dump!

There are several reasons to want to make the kernel dump core - the central one being there is some kernel or hardware based problem which continues to occur. What happens during a kernel panic (when properly configured) is that the kernel itself “dumps core” and the core can be used after reboot for analysis.

So here are some reasons:

  • Intermittent kernel reboots
  • Hard drive “lockups” (constant access, system frozen)
  • Apparent hardware failures
  • Speed problems in the kernel
  • Kernel panic debugging

All except the last depend on a user (administrator) generated kernel panic with associated kernel dump. Of course, this is hard on filesystems, though Linux at least has the option of performing a “sync” from the same location as the user generated panic.

Most UNIX operating systems have the capability for the administrator to generate a kernel-based core dump. Linux users must have a kernel that supports the Magic SysReq key. Solaris on SPARC is set to go; Solaris on Intel processors requires booting the Solaris kernel with the kmdb kernel module loaded (through parameters and settings in the boot loader).

Applications will also generate core dumps, and a lot of the core dump analysis tools used for applications and the methods used can be useful in analyzing kernel dumps as well. BEA has an excellent (multi-platform) description of creating and analyzing core dumps - even though it is oriented towards their Tuxedo product, it seems still useful.

Sun has an excellent article, Core Dump Management on the Solaris OS, that covers both application core dumps and system kernel core dumps written by Adam Zhang at Sun.

For HP-UX, there isn’t as much on crash dump analysis, though the whitepaper Debugging Core Files using HP WDB (PDF) may be useful.

I don’t know AIX (nor z/OS) that well myself, but there are some free RedBooks that include core dump analysis as part of the book. There is z/OS Diagnostic Data Collection and Analysis for z/OS (if you just happen to have a mainframe in house) and Problem Solving and Troubleshooting in AIX 5L for AIX.

Likely I’ll be covering some of these tools in depth.  For most versions of UNIX and Linux, there are man pages for core(5).  Some systems offer the commands gcore and savecore as well.  As always, the FreeBSD man pages web page covers HP-UX (HP-UX 11.22), Solaris (Solaris 9), and Red Hat (Red Hat Linux 9) and others as well as FreeBSD. Unfortunately, it appears that other Linux and UNIX versions are not being updated (for whatever reason - space?).


Add comment 16 November 2007

Help! 11 places to get help.

Where do you go when you don’t know the answer to a problem?  Most admins know a few places - but not many seem to go after all that are available to them.  See which of these you know and use:

  • Local instructions.  Corporate adminsitration teams often have their own documentation, and even if there isn’t any on paper or on any digital media, your coworkers may be able to help.
  • Previous experiences.  If you’ve been recording your technical successes and recording documentation et al, perhaps there is something in there.  If not - well, then, you lose, don’t you?  So start recording today!
  • Books. There are many books about system administration topics that may help.  Certification books are often good for technical materials as well.
  • Google. A search on Google (or your search engine of choice) may turn up something somewhere.
  • Personal Network.  Do you have a friend that is a wizard with these sort of problems?  Ask them.  If you don’t have a friend like this… find one!
  • Vendor Support and Documentation Pages. Often, the documentation and support pages from the manufacturer may include pertinent information.  Many of these will not be found in search engines, but can be found by performing a search at the manufacturers web site.  HP has the ITRC; Sun has SunSolve as well as BigAdmin.
  • Vendor Forums. Many vendors (such as Apple and HP) have forums that allow users to help each other. Do not neglect these! They are often searchable as well.
  • Usenet. Most (perhaps all?) systems, especially UNIX and Windows, are represented on Usenet newsgroups.  These can be a source of information, and can be searched (or used) through Google Groups.
  • User Groups.  User groups, whether national or local, can be a nice place to find resources to help.  There are Linux user groups (LUGs), Macintosh user groups, HP groups such as Encompass.
  • Mailing lists. This is similar to Usenet, but via email.
  • IRC.  Internet Relay Chat provides realtime communication with professionals that may be able to help.

Add comment 13 November 2007

Listing shared libraries in running processes

The utility lsof is a very useful utility, and can be used to list the shared libraries being used by a running process. It can be important to know if a running process is using a particular library, perhaps for forensics reasons or for library upgrades.

To list all the libraries in a particular process, try this command:

lsof -a -c name +D /usr/lib

This will list all files used by name in /usr/lib. To list all files used by name, just use:

lsof -c name

Alternately, to find all processes using a file (library) in /usr/lib, use this command:

lsof /usr/lib/libname

The -c option specifies the beginning of a name of a process to list. The -a option is used to create a boolean AND set; otherwise, lsof assumes a boolean OR set of options. With the +D option (which scans for files recursively down the directory tree), the first example looks for the process name that also has open files from the /usr/lib directory tree.

Another good use of lsof has to do with finding files that are open but deleted. Such a situation could potentially happen with a shared library if the library was deleted while a file was using it. This could perhaps happen during a library upgrade. Use this command to do this:

lsof +L1

The +L option specifies files with a specific number of links; here, any file with less than one link (that is, zero links) will be listed. Files with zero links are not listed in the filesystem but are open and in use by a file. The blocks from such files remain marked as in use by the filesystem, but the file cannot be found by name anywhere and has no inode.

There is a nice concise article by Joe Barr at Linux.com about what you can do with lsof. Lsof is available for download.


3 comments 12 November 2007

Debugging a Stuck pppd Process

I mentioned previously that on my Mac Mini I am using a cellular connection for my Internet link (instead of dial-up). However, from time to time, the connection would get stuck (after dropping) in the “Disconnecting…” state in the graphical tools. There didn’t seem to be anything I could do to stop it. The system doesn’t have what I usually consider essential tools - ptrace, strace, ltrace. In any case, there is a good chance that all three could be Linux-specific commands, and this system is running Mac OS X 10.4 (Tiger).

Then I remembered gdb. Looking up the processes for pppd I found this:

$ ps auwx | grep ppp[.]*d
root 21475 0.0 0.1 28040 1204 cu. Ss+ 10:24AM 0:00.57 pppd serviceid F31F5F28-9986-489D-88F3-CFA56FF89443 controlled
$
$ sudo gdb -p 21475
Password:
GNU gdb 6.1-20040303 (Apple version gdb-384) (Mon Mar 21 00:05:26 GMT 2005)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for details.
This GDB was configured as “powerpc-apple-darwin”.
/private/var/log/21475: No such file or directory.
Attaching to process 21475.
Reading symbols for shared libraries . done
Reading symbols for shared libraries …………………………………………….. done
0×90032084 in wait4 ()
(gdb) step
Single stepping until exit from function wait4,
which has no line number information.
^C
Program received signal SIGINT, Interrupt.
0×90032088 in wait4 ()
(gdb) q
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from process 21475 thread 0xd03.
$

This tells me that the pppd daemon is inside a wait4() function (described in the wait(2) man page). This function is waiting for a child process to complete. So then, the next step is: what is this child process that pppd is waiting on?

$ ps alwwx | grep ppp[.]*d
0 21475 42 0 31 0 552328 1228 - Ss+ cu. 0:00.57 pppd serviceid F31F5F28-9986-489D-88F3-CFA56FF89443 controlled
$ ps alwwx | grep 21475
501 25310 25201 0 31 0 8780 8 - R+ p3 0:00.00 grep 21475
0 21475 42 0 31 0 552328 1228 - Ss+ cu. 0:00.57 pppd serviceid F31F5F28-9986-489D-88F3-CFA56FF89443 controlled
0 25131 21475 0 31 0 27688 740 - S+ cu. 0:00.02 /usr/libexec/CCLEngine -m 1 -l F31F5F28-9986-489D-88F3-CFA56FF89443 -f /Library/Modem Scripts/Nokia 3G Packet RB 460 -v -E -S 5 -L 120 -I Internet Connect -i file://localhost/System/Library/Extensions/PPPSerial.ppp/Contents/Resources/NetworkConnect.icns -C Cancel
$ sudo gdb -p 25131
GNU gdb 6.1-20040303 (Apple version gdb-384) (Mon Mar 21 00:05:26 GMT 2005)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for details.
This GDB was configured as “powerpc-apple-darwin”.
/private/var/log/25131: No such file or directory.
Attaching to process 25131.
Reading symbols for shared libraries . done
Reading symbols for shared libraries …….. done
0×90001b04 in ioctl ()
(gdb) s
Single stepping until exit from function ioctl,
which has no line number information.
^C
Program received signal SIGINT, Interrupt.
0×90001b04 in ioctl ()
(gdb) q
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from process 25131 thread 0×20b.
$

So…. the script is stuck in ioctl (described in ioctl(2)). A kill was not sufficient, but a kill -9 stopped it. After this, the graphical tools stopped reporting “Disconnecting…” and a reconnect was possible - and went cleanly.

One aside: note the first line:

grep ppp[.]*d

This matches “nothing” (as well as multiple characters) but does not match the grep command itself (which it would if the nonsense pattern were not included). Small thing, but can help especially in scripts that grep through the ps command output. Other patterns are usable here; the key is that the pattern will not match itself, will match nothing (empty string), and will not match anything which is present in the output.


3 comments 4 August 2007


David Douthitt

David is an experienced UNIX and Linux system administrator, a former Linux distribution maintainer, and author of two books ("Advanced Topics in System Administration" and "GNU Screen: A Comprehensive Manual").

View David Douthitt's profile on LinkedIn

Top Posts

Calendar

May 2008
M T W T F S S
« Apr    
 1234
567891011
12131415161718
19202122232425
262728293031  

Recent Posts

Recent Comments

ddouthitt on Core Linux - packages
GRUBówka « Bl… on Installing GRUB on FreeBS…
monsun on Installing GRUB on FreeBS…
hictio on Core Linux - packages
locky on Installing GRUB on FreeBS…

Category Cloud

BSD Career Debian Debugging Fedora FreeBSD HPUX Learning Linux MacOS X Mind Hacks Mobile Computing NetBSD Networking OpenBSD OpenSolaris Open Source OpenVMS Personal Notes Portable Presentations Programming Red Hat Scripting Security Solaris Tips Ubuntu UNIX Wheel Group

Archives

Links