Generating a coredump (gcore)

If you wish to examine a runaway program outside of its element, you may choose to use the utility gcore. This utility is found in Solaris, Linux, and HP-UX, and perhaps others. The program syntax is:

gcore [ -o corename ] pid

The pid is the process id of the process to dump core, and the corename is the base of the filename to use for the core dump – the full name is the base name plus period (“.”) and the process id number. The default is to use “core“.

HP-UX systems will accept multiple process ids instead of just one. Solaris has several additional flags (as well as multiple pids). The additional Solaris flags won’t be covered here.

Once core has been dumped, the program continues operation; it does not stop. Thus, gcore is especially useful for taking a snapshot of a running process.

For example, consider a program with the process id 6674:

gcore 6674

This command generates a core file in the current directory with the name “core.6674“. This file then can be read by the GNU debugger gdb. Solaris also provides the dbx(1), mdb(1), and pstack(1) utilities. HP-UX provides gdb as well as the HP adb(1) utility. Both Solaris and HP-UX provide a core management utility coreadm(1m) – which is a topic for another day.

This article has an excellent description of working with core files in Solaris.

5 reasons to want a core dump!

There are several reasons to want to make the kernel dump core – the central one being there is some kernel or hardware based problem which continues to occur. What happens during a kernel panic (when properly configured) is that the kernel itself “dumps core” and the core can be used after reboot for analysis.

So here are some reasons:

  • Intermittent kernel reboots
  • Hard drive “lockups” (constant access, system frozen)
  • Apparent hardware failures
  • Speed problems in the kernel
  • Kernel panic debugging

All except the last depend on a user (administrator) generated kernel panic with associated kernel dump. Of course, this is hard on filesystems, though Linux at least has the option of performing a “sync” from the same location as the user generated panic.

Most UNIX operating systems have the capability for the administrator to generate a kernel-based core dump. Linux users must have a kernel that supports the Magic SysReq key. Solaris on SPARC is set to go; Solaris on Intel processors requires booting the Solaris kernel with the kmdb kernel module loaded (through parameters and settings in the boot loader).

Applications will also generate core dumps, and a lot of the core dump analysis tools used for applications and the methods used can be useful in analyzing kernel dumps as well. BEA has an excellent (multi-platform) description of creating and analyzing core dumps – even though it is oriented towards their Tuxedo product, it seems still useful.

Sun has an excellent article, Core Dump Management on the Solaris OS, that covers both application core dumps and system kernel core dumps written by Adam Zhang at Sun.

For HP-UX, there isn’t as much on crash dump analysis, though the whitepaper Debugging Core Files using HP WDB (PDF) may be useful.

I don’t know AIX (nor z/OS) that well myself, but there are some free RedBooks that include core dump analysis as part of the book. There is z/OS Diagnostic Data Collection and Analysis for z/OS (if you just happen to have a mainframe in house) and Problem Solving and Troubleshooting in AIX 5L for AIX.

Likely I’ll be covering some of these tools in depth.  For most versions of UNIX and Linux, there are man pages for core(5).  Some systems offer the commands gcore and savecore as well.  As always, the FreeBSD man pages web page covers HP-UX (HP-UX 11.22), Solaris (Solaris 9), and Red Hat (Red Hat Linux 9) and others as well as FreeBSD. Unfortunately, it appears that other Linux and UNIX versions are not being updated (for whatever reason – space?).

Debugging a Stuck pppd Process

I mentioned previously that on my Mac Mini I am using a cellular connection for my Internet link (instead of dial-up). However, from time to time, the connection would get stuck (after dropping) in the “Disconnecting…” state in the graphical tools. There didn’t seem to be anything I could do to stop it. The system doesn’t have what I usually consider essential tools – ptrace, strace, ltrace. In any case, there is a good chance that all three could be Linux-specific commands, and this system is running Mac OS X 10.4 (Tiger).

Then I remembered gdb. Looking up the processes for pppd I found this:

$ ps auwx | grep ppp[.]*d
root 21475 0.0 0.1 28040 1204 cu. Ss+ 10:24AM 0:00.57 pppd serviceid F31F5F28-9986-489D-88F3-CFA56FF89443 controlled
$
$ sudo gdb -p 21475
Password:
GNU gdb 6.1-20040303 (Apple version gdb-384) (Mon Mar 21 00:05:26 GMT 2005)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for details.
This GDB was configured as “powerpc-apple-darwin”.
/private/var/log/21475: No such file or directory.
Attaching to process 21475.
Reading symbols for shared libraries . done
Reading symbols for shared libraries …………………………………………….. done
0x90032084 in wait4 ()
(gdb) step
Single stepping until exit from function wait4,
which has no line number information.
^C
Program received signal SIGINT, Interrupt.
0x90032088 in wait4 ()
(gdb) q
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from process 21475 thread 0xd03.
$

This tells me that the pppd daemon is inside a wait4() function (described in the wait(2) man page). This function is waiting for a child process to complete. So then, the next step is: what is this child process that pppd is waiting on?

$ ps alwwx | grep ppp[.]*d
0 21475 42 0 31 0 552328 1228 – Ss+ cu. 0:00.57 pppd serviceid F31F5F28-9986-489D-88F3-CFA56FF89443 controlled
$ ps alwwx | grep 21475
501 25310 25201 0 31 0 8780 8 – R+ p3 0:00.00 grep 21475
0 21475 42 0 31 0 552328 1228 – Ss+ cu. 0:00.57 pppd serviceid F31F5F28-9986-489D-88F3-CFA56FF89443 controlled
0 25131 21475 0 31 0 27688 740 – S+ cu. 0:00.02 /usr/libexec/CCLEngine -m 1 -l F31F5F28-9986-489D-88F3-CFA56FF89443 -f /Library/Modem Scripts/Nokia 3G Packet RB 460 -v -E -S 5 -L 120 -I Internet Connect -i file://localhost/System/Library/Extensions/PPPSerial.ppp/Contents/Resources/NetworkConnect.icns -C Cancel
$ sudo gdb -p 25131
GNU gdb 6.1-20040303 (Apple version gdb-384) (Mon Mar 21 00:05:26 GMT 2005)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for details.
This GDB was configured as “powerpc-apple-darwin”.
/private/var/log/25131: No such file or directory.
Attaching to process 25131.
Reading symbols for shared libraries . done
Reading symbols for shared libraries …….. done
0x90001b04 in ioctl ()
(gdb) s
Single stepping until exit from function ioctl,
which has no line number information.
^C
Program received signal SIGINT, Interrupt.
0x90001b04 in ioctl ()
(gdb) q
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from process 25131 thread 0x20b.
$

So…. the script is stuck in ioctl (described in ioctl(2)). A kill was not sufficient, but a kill -9 stopped it. After this, the graphical tools stopped reporting “Disconnecting…” and a reconnect was possible – and went cleanly.

One aside: note the first line:

grep ppp[.]*d

This matches “nothing” (as well as multiple characters) but does not match the grep command itself (which it would if the nonsense pattern were not included). Small thing, but can help especially in scripts that grep through the ps command output. Other patterns are usable here; the key is that the pattern will not match itself, will match nothing (empty string), and will not match anything which is present in the output.