The Nagios Ecosystem: Nagios, Shinken, and Icinga

Nagios has been a standard-bearer for a long time, being developed originally by Ethan Galstad and included in Debian and Ubuntu for quite some time. In 2007, Ethan created a company built around providing enhancements to Nagios called Nagios Enterprises. However, for several years now there have been competitors to the original Nagios.

The first to come along was Icinga. This was a direct fork of the Nagios code that happened in May of 2009; the story of what lead to the fork was admirably reported by Free Software Magazine in April of 2012. In short, many developers were unhappy with the way that Nagios was being developed and with what they perceived as its many shortcomings which Ethan could not or would not fix. From Ethan’s standpoint, it was more about the enforcement of the Nagios trademark. The article summed it up best at the end: it’s complicated.

The H-Online also had an interview with Ethan Galstad about the future of Nagios and some of the history of the project.

Icinga is now in Ubuntu Universe and has been since Natty. It is also available for Debian Squeeze (current stable release).

Another project is Shinken: rather than a fork, it is a compatible replacement for the core Nagios code. When the Python-based Shinken code was rejected (vigorously) in summer of 2010 as a possible Nagios 4, it became an independent project. This project is newer than Icinga, but shows serious promise. It too, is now available in Ubuntu Universe and in Debian Wheezy (current testing release).

It is unfortunate that such animosity seems to swirl about Nagios; however, Icinga and Shinken appear to be quite healthy projects that provide much needed enhancements to Nagios users – and both are available in Ubuntu Precise Pangolin, the most recent Ubuntu LTS release.

I don’t know if Icinga or Shinken still work with Nagios mobile applications. If it’s just the URL, then the web server could rewrite the URL; if there is no compatible page for the mobile applications, then they can’t be used. However, I’d be surprised if there was no way to get the mobile apps working.

I’m going to try running Shinken and/or Nagios on an installation somewhere; we’ll see how it goes. I’ll report my experiences at a later date.

Using Nagios from an Android Phone

I got an Android phone in the last year, and started looking in earnest for a Nagios client for it. With a Nagios client, you can read what the current status is of your systems in Nagios.

There are several available; the two most often mentioned are NagRoid and NagMonDroid. However, neither one of these worked for me, and there are indeed others that were good.

All of the clients use the same basic method to get data from Nagios: scrape the data from a web page. The biggest problem comes when that web page is not available – or is incorrect. Most of these applications request a URL, but sometimes are unclear as to what URL they want exactly. Add to that the fact that Nagios changed its URL structure slightly between versions and it gets even more complicated.

To discover what was happening, I used tcpdump to watch the accesses to the web server from the Nagios clients, as well as watching the Apache logs. By doing this, I was able to discern what URLs were being loaded.

Here are some of the URL paths being looked for by the various clients:

  • /cgi-bin/tac.cgi
  • /cgi-bin/status.cgi
  • /cgi-bin/nagios3/statuswml.cgi?style=uprobs
  • /cgi-bin/nagios3/status.cgi?style=detail
  • /cgi-bin/nagios3/status.cgi?&servicestatustypes=29&serviceprops=262144

Further complicating matters in my case was the fact that any unrecognized URL was massaged (via mod_rewrite) into serving the main Nagios page via SSL.

However, by using mod_rewrite it was possible to rewrite the old /cgi-bin paths to a newer /cgi-bin/nagios3 path, and things started working.

In the case of the statuswml.cgi file, Google Chrome wanted to download the resulting file instead of actually using it somehow.

The main choices for Nagios clients on Android are these:

I have gone with aNag – it has a nice interface, good use of notification, and worked without trouble once the URL was fixed up. Several of the others never did work right – or they gave no indication that they were working right. In the case of jNag, it also requires a modified Nagios server and the installation of mkLivestatus. aNag was the one that was easiest to work with and get working.

aNag does use a mostly text-based format to show data, but it has the ability to manipulate services as well as one-button access to the web interface directly.

What’s Wrong with Nagios? (and Monitoring)

Nagios (or its new off-spring, Icinga) is the king of open source monitoring, and there are others like it. So what’s wrong with monitoring? Why does it bug me so?

Nagios is not the complete monitoring solution that many think it is because it can only mark the passing of a threshold: there is basically only two states: good and not good (ignoring “warning” and “unknown” for now).

What monitoring needs is two things: a) time, and b) flexibility.

Time is the ability to look at the change in a process or value over time. Disk I/O might or might not be high – but has it been high for the last twenty minutes or is it just on a peak? Has disk usage been slowly increasing or did it skyrocket in the last minute? This capability can be provided by tools like the System Event Correlator (SEC). The biggest problem with SEC is that it runs by scanning logs; if something isn’t logged SEC won’t see it.

The second thing is what drove me to write: there is no flexibility in these good/not-good checks that Nagios and its ilk provide. There is also not enough flexibility in SEC and others like it.

What is needed is a pattern recognition system – one that says, this load is not like the others that the system has experienced at this time in the past. If you look at a chart of system load on an average server (with users or developers on it) you’ll see that the load rises in the morning and decreases at closing time. When using Nagios, the load is either good or bad – with a single value. Yet a moderately heavy load could be a danger sign at 3 a.m. but not at 11 a.m. Likewise, having 30 users logged in is not a problem at 3 p.m. on a Tuesday – but could be a big problem at 3 p.m. on a Sunday.

What we really need is a learning system that can match the current information from the system with recorded information from the past – matched by time.

It’s always been said that open source is driven by someone “scratching an itch.” This sounds like mine…

Nagios and the Ampersand Character in URLs

When configuring Nagios for notifications, you will have something like this:

# commands.cfg

# vim: se nowrap

#========================================
# NOTIFICATIONS
#========================================

# 'notify-host-by-email' command definition
define command{
        command_name    notify-host-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
        }

# 'notify-service-by-email' command definition
define command{
        command_name    notify-service-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
        }

Now, you might want to extend these by adding the host or service URL. With this addition, the receiver of the message can click on the URL for more data.

The URL for the notes and actions are available; they are in these macros

  • $HOSTACTIONURL$
  • $SERVICEACTIONURL$
  • $HOSTNOTESURL$
  • $SERVICENOTESURL$

It should be simple to add these to the command_line line in the Nagios configuration; unfortunately, it is not that simple. This is because ampersands are removed from the URL when it is inserted. Ampersands are used to separate arguments that are sent to web pages – and may be included in the URLs. PNP4Nagios does this, and sends arguments to its web pages separated by ampersands (as it should).

The Nagios documentation hints at this character cleansing process, but does not mention these macros as ones that are cleaned and stripped of shell characters.

The shell characters that are stripped can be set with the illegal_macro_output_chars setting in the main configuration file. According to the documentation, the affected macros are: $HOSTOUTPUT$, $HOSTPERFDATA$, $HOSTACKAUTHOR$, $HOSTACKCOMMENT$, $SERVICEOUTPUT$, $SERVICEPERFDATA$, $SERVICEACKAUTHOR$, and $SERVICEACKCOMMENT$.

There are ways to escape some characters in certain situations, but escaping the ampersand doesn’t work here – especially since the URLs are taken literally (and uninterpreted) in other situations.

The resolution to this problem is to bypass the macro altogether and use the environment variables instead. Replace the macro with its environment variable equivalent. For example, replace:

$HOSTACTIONURL$

in the command_line entry with this:

$${NAGIOS_HOSTACTIONURL}

With this, the string that is actually passed to the shell is:

${NAGIOS_HOSTACTIONURL}

which then is interpreted as a shell environment variable (which, in this case, is set by Nagios). The squiggly brackets are possibly irrelevant, but they don’t hurt.

Using m4 with Nagios: Advanced Ideas

Nagios configuration has been traditionally cumbersome and extensive; there are a lot of things to configure. The addition of templating some time ago helped, but not entirely. A configuration element such as a server or a switch can take up a huge amount of configuration and be quite repetitive, too.

Using m4 can alleviate all of these problems. When combined with GNU Make and Nagios configuration directories, changing the configuration can be done quite simply and easily.

With this beginning of a Makefile in /etc/nagios, all *.cfg files will be converted to *.m4 files as they are included as
prerequisites:

M4=/usr/bin/m4
 
%.cfg : %.m4
        $(M4) -I conf.d/includes < $< > $*.cfg

With this default rule in place, and all configuration files in the conf.d directory can be converted with this
Makefile syntax:

FILES=$(wildcard conf.d/*.cfg)
 
all: $(FILES)

This uses the GNU Make wildcard function to generate a list of files easily. Other directories
can be added with new calls to the wildcard function; it is not recursive and won’t descend
into directories.

Finish off the Makefile with these:


restart: all
        service nagios3 restart
 
.PHONY: all restart

This makes it possible to put the m4 files into conf.d (with matching cfg file or it won’t activate!) and use
a library of predefined macro files included in conf.d/includes. If make is called with make restart then all
configuration files will be processed by m4 as needed and Nagios will be reloaded.

M4 should be included in the base Linux install – but often isn’t. Load it and use it today!

Unit Testing and System Administration

Unit tests are a programmer’s best friend – they help the programmer to fix bugs and keep them fixed, by continually testing to make sure that the bug is fixed.

In administering a system, certain services must be available, and certain products must be installed and configured. Install processes like HP-UX Ignite or Solaris Jumpstart can help, as can products like cfengine.

However, a unit test environment can be of great use to make sure that all went according to plan. Nagios is the best known of these – yes, a unit tester. Consider: do you need NFS? Test for an NFS volume. Do you require a database to be up? Check via TCP that the server’s connection is available.

In addition, if you’ve bounced a server only to find that you forgot something: create a unit test for it (that is, create a check in Nagios). If an active check won’t work, use a passive check: a check that runs on the server and reports back to Nagios.

If you continue to add checks as you think of them or encounter trouble, eventually you will find that you are much more in tune with your servers. Don’t forget to add NagiosGrapher to get the benefit of performance history too. With both Nagios and NagiosGrapher, you’ll be all set.

Powered by ScribeFire.

Nagios Tips: Did You Know… ?

There are a number of things within Nagios that I did not know it could do until I had used it for some time.  I thought I would pass these facts on to you.  Once you know them, it seems simple – but only afterwards.

For example, consider the Host and Service Status Totals at the top of the screen.

All text (except the title) is clickable.  If you click on “All Problems” it will show the appropriate problem entries (assuming they can be seen in the current view!).

Another example is the Service Overview: if you click on the extended title for a service group, you’ll see all details for that service group.  However, if you click on the short title for a service group, you’ll be able to take actions on the entire service group as a whole (very nice!).  You can schedule downtime, enable or disable notifications, and enable or disable active checks.

This capability extends to the Host Groups as well: you can (at the appropriate screen) enable downtime for a hostgroup, enable or disable notifications for a hostgroup or for all services in a hostgroup, and enable or disable active checks for all services in a hostgroup.

Don’t forget to look at the inocculous-looking info box at the top left of the main Nagios data window; this window often provides ways to look at details of the current view.  For example, when looking at the Service Details for a particular host group, you can switch to a number of other views relating to the current host group, or for all host groups.

There is also the ability to sort the Status Details report.  This allows you to answer questions like these:

  • What is the most recent check completed?  (order by “Last Check”)
  • What is the longest status duration? (order by “Duration”)

Any column can be sorted except “Status Information” – click on the arrows at the title.  Normally this report is sorted alphabetically by Host then by Service.

However, suppose you want only one particular service group?  Click on the Service name, then under “Member of” in the next screen click on the group name.  Thus you see the Service Overview for that service group.  From there you can see the Service Details (by clicking the full title) or Actions (by clicking on the short title).

With all of these ways to view problems, you can answer your questions quicker and view the results faster.

Powered by ScribeFire.

About NagiosGrapher (v1.7.1)

NagiosGrapher is a tool that fits in with Nagios and adds graphing capabilities. Anything that is reported by a Nagios check can be graphed. The software is available from NagiosForge.

NagiosGrapher uses RRDTool to maintain its data, so space is not typically an issue – old data that falls off the graph is not kept any longer. For example, details reported every 15 minutes don’t have to be kept beyond about 48 hours (two days).

Graphs are available for daily, weekly, monthly, and yearly reports.

If you install NagiosGrapher under a Red Hat system, you may have to tell it to use layout “red_hat” (as the automatic tool may discover “redhat” instead of “red_hat”). Once installed, there are a number of files and directories created.

/etc/init.d/nagios_grapher

This is the startup script for NagiosGrapher. Make sure that the location of the daemon is set correctly; I had to change it to read this way:

DAEMON=/usr/lib/nagios/plugins/contrib/collect2.pl

The script as delivered does not support Red Hat’s chkconfig utility; add these lines to make it support chkconfig (just under the top line):

#
# chkconfig: 345 99 01
# description: NagiosGrapher

Then these commands need to be run to activate NagiosGrapher:

# chkconfig --add nagios_grapher
# chkconfig nagios_grapher on

Next time the system starts NagiosGrapher should start as well.

/etc/nagios/ngraph.ncfg

This is the general configuration file for NagiosGrapher itself. Perhaps the main thing to edit in this file is the option log_level (a bit-mapped value). This may be set high initially. I have it set to 63; it could probably be reduced further.

/var/log/nagios/ngraph.log

This is where the NagiosGrapher log is stored. The log details the operations of NagiosGrapher and the amount of detail is directly related to the log_level value in /etc/nagios/ngraph.ncfg.

/var/log/nagios/service-perfdata*

These files are the performance data stored by Nagios. These are actually Nagios files (not NagiosGrapher files) but are read by NagiosGrapher to create its information. If these start building up, then NagiosGrapher is not reading them appropriately; it may be worthwhile to restart the service using the initiliazation script.

/etc/nagios/ngraph.d

This directory contains the graph configuration. This configuration data is used to create the actual graphs. Many of these items will directly correlate with RRDTool commands and their options.

The defaults are set in nmgraph.ncfg, and the specific graphs can be found under the templates directory. To disable a graph configuration, just add the word _diabled on the end. Presumably this works because the file no longer ends in .ncfg (the standard NagiosGrapher configuration file ending). Likewise, to enable a disabled configuration remove the _disabled from the end of it (and make sure it ends in .ncfg.

/etc/nagios/serviceext

This is a directory where NagiosGrapher stores appropriate configurations for each host that it reads from the performance data that it handles. These files are created automatically; to get the configuration to be loaded into Nagios, Nagios must be reloaded.

/usr/lib/nagios/plugins/contrib/

NagiosGrapher adds several files to this directory: udpecho, fifo_write.pl, fifo_write, collect2.pl, and the directory perl (which contains library files for NagiosGrapher). The most notable of these is collect2.pl: this is the actual program that runs and which scans the performance data for appropriate details.

/usr/lib/nagios/cgi/

NagiosGrapher adds several files to this directory as well: graphs.cgi, rrd2-graph.cgi, and rrd2-system.cgi.

I found that rrd2-system.cgi would not work: it would not present any PNG graphs. Adding this “fix” took care of the problem:

--- rrd2-system.cgi     2009-02-04 15:34:39.000000000 -0600
+++ /usr/lib/nagios/cgi/rrd2-system.cgi 2009-02-04 18:47:23.000000000 -0600
@@ -97,6 +97,7 @@
        $image_bin =~ s/^(\[.*?\])//;
 }

+if (1 == 0) {
 if ($image_format eq 'PNG' && $code == 0 && !$only_graph && !$no_legend) {
        $ng->time_need(type => 'start');
        # Adding brand
@@ -128,6 +129,7 @@
                $image_bin = $blobs[0];
        $ng->time_need(type => 'stop', msg => 'Adding PerlMagick');
 }
+}

 # no buffered operations
 STDOUT->autoflush(1);

It is rather a brute force fix, but I didn’t want to haggle with it. The fix just removes some stuff that isn’t needed for working operation of NagiosGrapher. It would have been nice to actually fix the problem, though.

/var/lib/nagios/rrd/

This is where the RRDTool files are kept. All RRD files are kept in a directory by hostname, and each has a hexadecimal name with the extension .rrd. The seemingly random names are correlated to services by the file index.ngraph, also in this directory.

Powered by ScribeFire.

Configuring Nagios with m4

When using m4 to configure Nagios, great advantages can be realized.  One of the easiest places to gain an advantage by using m4 is when defining a new host.

Typically, a new host not only has a host definition but a number of fairly standardized services – such as ping, FTP, telnet, SSH, and so forth.  Thus, when defining a new host configuration, you not only have to add a new host, but all of the relevant services as well – and may also include host extra info and service extra info also.

#----------------------------------------
# HOST: marco
#----------------------------------------
define host{
        use                     hpux-host               ; Name of host template
        host_name               marco
        address                 192.168.4.1
        }
define hostextinfo{
        host_name               marco
        action_url              http://marco-mp/
}
define service{
        use                             passive-service          ; Name of servi
        host_name                       marco
        service_description             System Load
        servicegroups                   Load
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             PING
        check_command                   check_ping!100.0,20%!500.0,60%
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             TELNET
        servicegroups                   TELNET
        check_command                   check_telnet
        }
define serviceextinfo{
        host_name                       marco
        service_description             TELNET
        action_url                      telnet://marco
}
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             FTP
        servicegroups                   FTP
        check_command                   check_ftp
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             NTP
        servicegroups                   NTP
        check_command                   check_ntp
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             SSH
        servicegroups                   SSH
        check_command                   check_ssh
        }

Compare that output from the m4 code that generated it:

DEFHPUX(`marco',`192.168.4.1')

Another benefit is that if DEFHPUX is coded correctly (with each service independent – such as an m4 macro DOSSH for SSH) – then a single change to the m4 file, propogated to the Nagios config file, can alter a service for every HP-UX host (in this example).

Here is a possible definition of DEFHPUX:

define(`DEFHPUX',`
#----------------------------------------
# HOST: $1
#----------------------------------------
define host{
        use                     hpux-host               ; Name of host template
        host_name               $1
        address                 $2
        }
define hostextinfo{
        host_name               $1
        action_url              http://$1-mp/
}'
DOLOAD(`$1')
DOPING(`$1')
DOTELNET(`$1')
DOFTP(`$1')
DONTP(`$1')
DOSSH(`$1')

There is a lot more that m4 can do; this is just the tip of the iceberg.

Powered by ScribeFire.

Nagios and m4

The macro processor m4 is perhaps one of the most underappreciated programs in a typical UNIX environment. Sendmail may be the only reason it still exists – or would that distinction include GNU autotools?

Configuring Nagios can be dramatically simplified and made easier using m4 with Nagios templates. Even when using Nagios service and host templates, there is a lot of repetition in the typical services file – perhaps an extreme amount of duplicated entries.

Using m4 can reduce the amount of work that is required to enter items.

Here is an example:

DEFHPUX(`red',`10.1.1.1')
DEFHPUX(`green',`10.1.1.2')
DEFHPUX(`blue',`10.1.1.3')
DEFHPUX(`white',`10.1.1.4')
DEFHPUX(`black',`10.1.1.5')
DEFHPUX(`orange',`10.1.1.6')

In my configuration, each line above expands into 64 lines (including three lines of header in the comments). So the result of those six lines is 384 lines of output.

Every DEFHPUX creates a host, complete with standard service checks such as PING, SSH, and TELNET. This is done all with just a few macro definitions at the beginning of the file.

Read about m4 and understand it, and your Nagios configurations will be much easier. You can use the program make to automate the creation of the actual config files and the check and reload necessary for Nagios to incorporate the changes.