Archive

Archive for the ‘Debugging’ Category

How much can you find out about a HP-UX process?

31 January 2009 ddouthitt Leave a comment

The answer to this question can be important many times. Let’s take some examples of what can be done to find out all we can about a particular process.

There are, of course, simple things that can be done. Let’s take midaemon as an example. From the command line, we can find out where it is, what it is, and some description of it:

# type midaemon
midaemon is /opt/perf/bin/midaemon
# what `which midaemon`
/opt/perf/bin/midaemon:
        midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
# file `which midaemon`
/opt/perf/bin/midaemon: ELF-32 executable object file - IA64
# ldd `which midaemon`
        libpthread.so.1 =>      /usr/lib/hpux32/libpthread.so.1
        libIO.so =>     /opt/perf/lib/hpux32/libIO.so
        libc.so.1 =>    /usr/lib/hpux32/libc.so.1
        libdl.so.1 =>   /usr/lib/hpux32/libdl.so.1
# man midaemon
# cd /sbin/init.d
# grep midaemon
# cd /etc/rc.config.d
# grep -i midaemon *
# swlist -l file | grep midaemon
  MeasurementInt.MI: /opt/perf/bin/midaemon
  MeasurementInt.MI: /opt/perf/man/man1/midaemon.1
  MeasurementInt.MI-JPN: /opt/perf/man/ja_JP.SJIS/man1/midaemon.1
#

This tells us a lot already: it’s part of the performance system (/opt/perf) and is 32-bit and is part of the MeasurementInt package (and has a Japanese man page!). The man page explains the program in detail.

But there’s more. Let’s suppose that lsof is on hand (as it should be!); then we can do this:

# lsof -c midaemon
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
midaemon 2198 root  cwd    DIR 64,0x3     8192     2 /
midaemon 2198 root  txt    REG 64,0x5   828932 13799 /opt/perf/bin/midaemon
midaemon 2198 root  mem    REG 64,0x8    19799   956 /usr/lib/tztab
midaemon 2198 root  mem    REG 64,0x8    87900    78 /usr/lib/hpux32/libnss_dns.so.1
midaemon 2198 root  mem    REG 64,0x8   169104   722 /usr/lib/hpux32/libnss_files.so.1
midaemon 2198 root  mem    REG 64,0x8    76236 19454 /usr/lib/hpux32/libdl.so.1
midaemon 2198 root  mem    REG 64,0x8  4929272   695 /usr/lib/hpux32/libc.so.1
midaemon 2198 root  mem    REG 64,0x5   115124 13809 /opt/perf/lib/hpux32/libIO.so
midaemon 2198 root  mem    REG 64,0x8  1505144   734 /usr/lib/hpux32/libpthread.so.1
midaemon 2198 root  mem    REG 64,0x8  1065976 19453 /usr/lib/hpux32/dld.so
midaemon 2198 root  mem    REG 64,0x8   176988 19535 /usr/lib/hpux32/uld.so
midaemon 2198 root    2u   REG 64,0x9     1174 17923 /var (/dev/vg00/lvol9)
midaemon 2198 root    3u   REG 64,0x9     1174 17923 /var (/dev/vg00/lvol9)
midaemon 2198 root    4u   REG 64,0x9    11303 17949 /var (/dev/vg00/lvol9)
midaemon 2198 root    5u   REG 64,0x9    11303 17949 /var (/dev/vg00/lvol9)
midaemon 2198 root    7r   REG 64,0x9    13689  1620 /var/opt/perf/parm

This shows that the working directory is / (root); stdin and stdout are closed (0u and 1u in the FD column); stderr is still open and tied to /var; and there are four other file descriptors open: three on /var and one is the /var/opt/perf/parm file (configuration). We can also deduce that there was another file descriptor opened which is now closed (and would have been 6u).

There is also no network connections open, or pipes, or other things.

The ps output provides more details:

# ps -elf | sed -n '1p; /midaem[.]*on/p;'
  F S      UID   PID  PPID  C PRI NI             ADDR   SZ            WCHAN    STIME TTY       TIME COMD
541 R     root  2198     1  0 -16 20 e00000060de31b80  524                -  Jan 15  ?        28:55 /opt/perf/bin/midaemon

From this we can see it is relatively small (SZ = 524). This example also shows a couple of tricks: using sed this way keeps the header intact (1p) and also matches midaemon without matching the search string.

Using glance, we can find out even more. Using the text mode command glance, first select the process (using the command key s and entering the pid – 2198). Then a view of the current activity by the process is given. In this case, we can see the total size is 51.6Mb (VSS) and in memory size is 44.8Mb (RSS). We can also see that the process appears to be switching voluntarily almost all of the time – that is, it never utilizes its full time slice when scheduled.

From that process summary display, enter the command key M. This provides a detailed memory display of the process – very useful. The various types of memory used by the process are broken down at the bottom in summary: text refers to the program code; data is program data; stack is a working area as well as where function calls are stored; shmem refers to shared memory (memory shared between processes); and other, which is everything else. All these areas are shown explicitly above in the main display.

Using the command key F, we can see again what lsof showed us. With an inode number, we can search for the file explicitly. Using lsof:

# lsof  | sed -n '1p;  / 17949 /p'
COMMAND     PID     USER   FD   TYPE             DEVICE    SIZE/OFF    NODE NAME
scopeux    2150     root    0u   REG             64,0x9       11303   17949 /var (/dev/vg00/lvol9)
scopeux    2150     root    1u   REG             64,0x9       11303   17949 /var (/dev/vg00/lvol9)
scopeux    2150     root    2u   REG             64,0x9       11303   17949 /var (/dev/vg00/lvol9)
scopeux    2150     root    4u   REG             64,0x9       11303   17949 /var (/dev/vg00/lvol9)
scopeux    2150     root    5u   REG             64,0x9       11303   17949 /var (/dev/vg00/lvol9)
midaemon   2198     root    4u   REG             64,0x9       11303   17949 /var (/dev/vg00/lvol9)
midaemon   2198     root    5u   REG             64,0x9       11303   17949 /var (/dev/vg00/lvol9)
# lsof  | sed -n '1p;  / 17923 /p'
COMMAND     PID     USER   FD   TYPE             DEVICE    SIZE/OFF    NODE NAME
midaemon   2198     root    2u   REG             64,0x9        1174   17923 /var (/dev/vg00/lvol9)
midaemon   2198     root    3u   REG             64,0x9        1174   17923 /var (/dev/vg00/lvol9)
#

It would appear that scopeux (another command) is sharing a file with midaemon (inode 17949) on /var, and that inode 17923 is not shared. Since there is no file listed, it is likely that these files were created, then deleted after opening. (The inode remains, but the file is not listed in the directory).

Another useful tool is tusc:

sybil # tusc 2198
( Attached to process 2198 ("/opt/perf/bin/midaemon") [32-bit] )
ki_call(KI_TRACE_GET, 0x40080ab0, 0x80000, 0x7ffff860) ............................................................... [sleeping]
In user-mode ......................................................................................................... [sleeping]
In user-mode ......................................................................................................... [sleeping]
In user-mode ......................................................................................................... [sleeping]
In user-mode ......................................................................................................... [sleeping]
ksleep(PTH_CONDVAR_OBJECT, 0x400108b0, 0x400108b8, NULL) ............................................................. [sleeping]
ki_call(KI_TRACE_GET, 0x40080ab0, 0x80000, 0x7ffff860) ............................................................... = 8
kwakeup(PTH_CONDVAR_OBJECT, 0x400108b0, WAKEUP_ONE, 0x7ffff7c0) ...................................................... = 0
ksleep(PTH_CONDVAR_OBJECT, 0x400108b0, 0x400108b8, NULL) ............................................................. = 0
ki_call(KI_TRACE_GET, 0x40080b50, 0x80000, 0x7ffff860) ............................................................... = 8
kwakeup(PTH_CONDVAR_OBJECT, 0x400108b0, WAKEUP_ONE, 0x7ffff7c0) ...................................................... = 0
ksleep(PTH_CONDVAR_OBJECT, 0x400108b0, 0x400108b8, NULL) ............................................................. = 0
ki_call(KI_TRACE_GET, 0x40080bf0, 0x80000, 0x7ffff860) ............................................................... = 8
kwakeup(PTH_CONDVAR_OBJECT, 0x400108b0, WAKEUP_ONE, 0x7ffff7c0) ...................................................... = 0
ksleep(PTH_CONDVAR_OBJECT, 0x400108b0, 0x400108b8, NULL) ............................................................. = 0
ki_call(KI_TRACE_GET, 0x40080c90, 0x80000, 0x7ffff860) ............................................................... = 8
ksleep(PTH_CONDVAR_OBJECT, 0x400108b0, 0x400108b8, NULL) ............................................................. = 0
kwakeup(PTH_CONDVAR_OBJECT, 0x400108b0, WAKEUP_ONE, 0x7ffff7c0) ...................................................... = 0
ki_call(KI_TRACE_GET, 0x40080ab0, 0x80000, 0x7ffff860) ............................................................... = 8
kwakeup(PTH_CONDVAR_OBJECT, 0x400108b0, WAKEUP_ONE, 0x7ffff7c0) ...................................................... = 0
ksleep(PTH_CONDVAR_OBJECT, 0x400108b0, 0x400108b8, NULL) ............................................................. = 0
( Detaching from process 2198 ("/opt/perf/bin/midaemon") )

The tusc command will show you what the process is doing, and what system calls it is making. If the process can be started from scratch (by restarting the program binary) then a lot of information can be gathered using tusc.

A summary view of this same data can be gotten from glance, using the L command key to show the system calls made and the time spent in each one. Just ask tusc related, in this case ki_call(), ksleep(), and kwakeup() are the three system calls be done.

Again using glance, if you want to see the wait states for the process (reasons the process gives up the CPU to other processes) use the W key command. For midaemon, it shows sleep as the reason for 85% of wait states in this process.

We can look through the binary for even more detail:

# strings `which midaemon` | head -n 7
/var/opt/perf/status.mi
/var/opt/perf/status.mi
/dev/ptym/
@$Header: miflock.c,v 1.2 95/09/27 08:43:20 thierry Exp $
@(#)midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
-pstat_freq
        4p
# tail -n 30 /var/opt/perf/status.mi
midaemon: Tue Oct 28 23:53:34 2008
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
midaemon: Wed Oct 29 03:31:41 2008
Stop midaemon - non-permanent/no-client, normal MI termination
midaemon: Wed Oct 29 03:39:56 2008
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
midaemon: Tue Nov 11 19:10:11 2008
Stop midaemon - non-permanent/no-client, normal MI termination
midaemon: Tue Nov 11 19:21:32 2008
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
midaemon: Fri Nov 21 21:30:21 2008
Stop midaemon - non-permanent/no-client, normal MI termination
midaemon: Fri Nov 21 21:38:29 2008
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
midaemon: Fri Nov 28 10:15:28 2008
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
midaemon: Wed Dec 10 11:41:26 2008
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
midaemon: Thu Jan 15 21:31:06 2009
Stop midaemon - Commanded MI termination
midaemon: Thu Jan 15 21:42:42 2009
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
midaemon: Thu Jan 15 21:55:53 2009
Stop midaemon - Commanded MI termination
midaemon: Thu Jan 15 22:03:59 2009
Start midaemon       C.04.70.000  10/03/07 HP-UX 11 =*=
Categories: Debugging, HP-UX Tags: , , ,

Bug: synergyc freezes

19 January 2009 ddouthitt 1 comment

If you are using Synergy in your daily work, you may have noticed that the Linux client is not working as it should. A bug (Bug #194029) reported to the Ubuntu development team provides extensive reports about the problem and possible resolutions. The biggest problem is sifting through all of them, as well as the realization that they don’t seem to have it fixed yet.

Admittedly, my problems are with Fedora 9 and Windowmaker (which shows that its not Ubuntu-specific, nor is it specific to GNOME or KDE). However, the resolutions seem to work under Fedora just as well as under Ubuntu.

The resolutions recommended are:

  • Run synergy as root: sudo synergyc. This resolution seems to be one least likely to work.
  • Run synergy with the highest priority possible using chrt: chrt -p 99 synergyc. This method can be incorporated into a startup script thusly: /usr/bin/synergyc myserver; pgrep synergyc | sudo xargs chrt -p 99
  • Recompiling the Linux kernel with a different scheduler: instead of configuring with CONFIG_FAIR_USER_SCHED use CONFIG_FAIR_CGROUP_SCHED.
  • Patching synergyc to fix the problem.
  • Enabling (for Ubuntu only) the hardy-proposed repository and updating the kernel to 2.6.24-16 or 2.6.24-17-generic seemed to work (although there were complaints that the desktop became sluggish).

In the case of Fedora 9 at least, this bug remains present even though it is almost a year old. I don’t use synergyc on a Ubuntu client – my Kubuntu host I use almost entirely at the console directly.

So what is the answer? I’d try using chrt first (for me that lessened the problem dramatically) and try upgrading to a new kernel configuration.

Solaris Virtual Memory Analysis: a tour through one admin’s process

28 November 2008 ddouthitt Leave a comment

This article by A. J. Clark was very informative; it doesn’t just show you what the problem is, but takes you through the process as the administrator analyzes a Solaris 8 server trying to find out why swap space was so heavily used. Go read it!

Categories: Debugging, Solaris Tags: ,

Getting a network interface to function

13 June 2008 ddouthitt Leave a comment

When bringing up a machine, and having to debug network connectivity, there is no substitute for being able to look at network traffic on the wire. Be aware that sniffing traffic can be fatal to your employment and perhaps your career if you do not follow the approved practices in your environment. If you do have the permission to perform network sniffing, it is an invaluable asset for debugging network problems.

One thing to be aware of, especially when not using UNIX or Linux, is that TCP/IP is an add-on protocol for other environments such as Windows and OpenVMS.

What can you determine from sniffing the network traffic?

  • Is the system sending out traffic at all?
  • What is the actual MAC address of the interface?
  • Are ARP requests going out?
  • Is DHCP being used? Is it failing or succeeding?
  • Is DNS being used? Is it failing or succeeding?
  • Is ping working? Are replies being received?

There are many other things that can be answered through looking at the network traffic. At its most basic (if network connectivity is the problem), the server can be disconnected and traffic looked at from the switch (with the normal cable) and from the server (using a cross-over cable).

With this information, it may be possible to clear up many netowrk connectivity problems.

System Reboots Require These Tools and Practices

28 March 2008 ddouthitt 4 comments

When a long-running server needs to be rebooted, what are the most important tools? Remember, reboots on many systems can be weeks, months, or even years in between. So a reboot is not a normal occurence for the machine.

So what would the best tools to have on hand? Paper and pen. Take extensive notes of everything that happens out of the ordinary as the system comes up – things to fix, things to watch out for, and so on. Recording how much time it takes may not be a bad idea. Watch for services that are not required and shut them down as needed.

When debugging the reboot process, make sure to get evidence of a completely clean startup before considering the job done. The job may look like it is done, but if a reboot exposes a failure in configuration or other problems, then it’s not done – and you won’t know unless you reboot.

Also when you reboot, make sure that all subsystems are up and running. Often, important subsystems are not set to automatically start up – in case the system crashes, the idea is to keep the system off-line until the reason for its demise is fully known. So don’t forget these important subsystems and start them up after booting – whether the system is Caché or Oracle or some other.

Generating a coredump (gcore)

16 January 2008 ddouthitt Leave a comment

If you wish to examine a runaway program outside of its element, you may choose to use the utility gcore. This utility is found in Solaris, Linux, and HP-UX, and perhaps others. The program syntax is:

gcore [ -o corename ] pid

The pid is the process id of the process to dump core, and the corename is the base of the filename to use for the core dump – the full name is the base name plus period (“.”) and the process id number. The default is to use “core“.

HP-UX systems will accept multiple process ids instead of just one. Solaris has several additional flags (as well as multiple pids). The additional Solaris flags won’t be covered here.

Once core has been dumped, the program continues operation; it does not stop. Thus, gcore is especially useful for taking a snapshot of a running process.

For example, consider a program with the process id 6674:

gcore 6674

This command generates a core file in the current directory with the name “core.6674“. This file then can be read by the GNU debugger gdb. Solaris also provides the dbx(1), mdb(1), and pstack(1) utilities. HP-UX provides gdb as well as the HP adb(1) utility. Both Solaris and HP-UX provide a core management utility coreadm(1m) – which is a topic for another day.

This article has an excellent description of working with core files in Solaris.

What to do when the system libraries go away…

9 January 2008 ddouthitt Leave a comment

You’ve been hacking away at this system (let’s be positive and upbeat and say it’s a test system and not production). Through a slip of the fingers, you move the system libraries out of the way – all of them. Now nothing can find the libraries. Now what? Is everything lost?

Don’t despair! You can do a lot without libraries. Already loaded software has the libraries in memory, so that is okay. This includes the shell, so the shell should be okay.

There may be some statically compiled binaries on the system that don’t require libraries; these can be run. If a scripting language like perl or ruby is statically compiled, then all is well – these languages can do anything, and can replace binaries (temporarily) such as mv, cp, and others. However, since vi is probably not statically linked, you may have to do it at the command line (and not in an editor).

Here are some things one can do:

echo *

Through the use of the shell’s filename expansion, this works out to a reasonable imitation of ls (ls -m, in particular). If you have to empty a file (make the contents nothing), use this command:

> file

Every standard utility today is dynamically linked; this means that in situations like these you are stuck with only what the shell itself provides. Remember that things like cat, ls, mv, cp, vi, rm, ln, and so on are all system executables – and quite possibly dynamically linked.

The best thing for a situation like this is to have prepared in advance – have a copy of busybox handy, and possibly a statically compiled perl or ruby (or both). Don’t forget editors – either have a copy of e3 or of a statically compiled editor. Busybox provides all the standard utilities in one statically created binary, and e3 is an editor that is tiny (and i386-specific) which emulates vi, pico, wordstar, and emacs (based on its name).  Neither busybox nor e3 require additional libraries.

A good tool (and a good tool in case of security breach) is a small CDROM of tools, all statically linked for your environment. Such a disk requires no libraries at all – and could have all of these necessary tools and more.

Of course, the best thing is to avoid doing this kind of thing in the first place…

SystemTap (and DTrace)

4 December 2007 ddouthitt 1 comment

SystemTap is one amazing piece of work – it is a programmer-friendy and admin-friendly interface to KProbes (which are included in the Linux 2.6 kernel).  When you compare its capabilities to what has gone before, it is truly amazing.  Here are some of the things you can do:

  • Quantify disk accesses per disk per process (or per user)
  • Quantify the number of context switches that are a result of time outs
  • List all accesses to a particular file and the process that accessed it

This is only the tip of the iceberg. There is a wiki with more details, including “war stories.”  There is a language reference there as well.

There was an excellent article in Red Hat Magazine, “Instrumenting the Linux Kernel with SystemTap” by William Cohen.

One controversy that came up was that the initial impetus for creating SystemTap was to implement something like Sun’s DTrace for Solaris but under the GNU Public License.  Solaris and DTrace are licensed with Sun’s Common Development and Distribution License (CDDL), which many feel makes DTrace incompatible with the GPL-licensed Linux kernel.

Apparently, the CDDL is also incompatible with the BSD-licensed FreeBSD, as FreeBSD 7.0 will not have DTrace either.  There appears to be some licensing issues.

According to the Wikipedia entry on the CDDL, it was designed to be both GPL-incompatible and BSD-incompatible.  With regard to the GPL, the entry suggests that Sun never clarified why; as to the BSD, Sun did not want Solaris to wind up in proprietary products – which the BSD license allows.

On a brighter note, Eugene Teo was able to get the SystemTap tool to work on the Nokia N800.  The article seems to be behind a wall at LiveJournal; the article is still in Google’s cache.  However, it does requires some amazing convolutions:

  • A kprobes-enabled kernel must be installed on the N800
  • The SystemTap programs (like stap) must be installed on the N800
  • Any traces must be cross-compiled on another host
  • The kernel module thus created must be moved to the N800
  • Once the kernel module is in place, then the trace can be done.

So every desired trace requires precross-compilation on a desktop (sigh)…  Oh, well.

There is even a GUI for SystemTap in the works.

OpenSolaris on a MacBook

22 November 2007 ddouthitt 2 comments

OpenSolaris is very interesting, and since the introduction of dtrace and ZFS has enthralled many. I tried to install it onto my HP Compaq E300 laptop (which it was unsuitable for), and tried to install it onto an HP Compaq 6910p laptop. In this case, the networking was unsupported: both the ethernet and the wireless drivers were not included with OpenSolaris Express (Developer Edition).

In any case, I expect I might just be shopping for a laptop in the next year – and it’s nice to see that OpenSolaris does run on the Apple MacBook.  This article goes into detail about how the writer got it to work, and each of the steps that were taken to make it happen.  Paul Mitchell from Sun discusses dual-partitioning a MacBook in this context as well.  Alan Perry (also from Sun) had done the same thing with a Mac Mini, and Paul extended it to the MacBook.  Both entries are detailed and have to do with MacOS X and Solaris dual-booting.

An a different note, check out the graph of library calls from dtrace in this article.  From what I’ve heard of dtrace, it’s the ultimate when it comes to debugging…

5 reasons to want a core dump!

16 November 2007 ddouthitt Leave a comment

There are several reasons to want to make the kernel dump core – the central one being there is some kernel or hardware based problem which continues to occur. What happens during a kernel panic (when properly configured) is that the kernel itself “dumps core” and the core can be used after reboot for analysis.

So here are some reasons:

  • Intermittent kernel reboots
  • Hard drive “lockups” (constant access, system frozen)
  • Apparent hardware failures
  • Speed problems in the kernel
  • Kernel panic debugging

All except the last depend on a user (administrator) generated kernel panic with associated kernel dump. Of course, this is hard on filesystems, though Linux at least has the option of performing a “sync” from the same location as the user generated panic.

Most UNIX operating systems have the capability for the administrator to generate a kernel-based core dump. Linux users must have a kernel that supports the Magic SysReq key. Solaris on SPARC is set to go; Solaris on Intel processors requires booting the Solaris kernel with the kmdb kernel module loaded (through parameters and settings in the boot loader).

Applications will also generate core dumps, and a lot of the core dump analysis tools used for applications and the methods used can be useful in analyzing kernel dumps as well. BEA has an excellent (multi-platform) description of creating and analyzing core dumps – even though it is oriented towards their Tuxedo product, it seems still useful.

Sun has an excellent article, Core Dump Management on the Solaris OS, that covers both application core dumps and system kernel core dumps written by Adam Zhang at Sun.

For HP-UX, there isn’t as much on crash dump analysis, though the whitepaper Debugging Core Files using HP WDB (PDF) may be useful.

I don’t know AIX (nor z/OS) that well myself, but there are some free RedBooks that include core dump analysis as part of the book. There is z/OS Diagnostic Data Collection and Analysis for z/OS (if you just happen to have a mainframe in house) and Problem Solving and Troubleshooting in AIX 5L for AIX.

Likely I’ll be covering some of these tools in depth.  For most versions of UNIX and Linux, there are man pages for core(5).  Some systems offer the commands gcore and savecore as well.  As always, the FreeBSD man pages web page covers HP-UX (HP-UX 11.22), Solaris (Solaris 9), and Red Hat (Red Hat Linux 9) and others as well as FreeBSD. Unfortunately, it appears that other Linux and UNIX versions are not being updated (for whatever reason – space?).