The Nagios Ecosystem: Nagios, Shinken, and Icinga

Nagios has been a standard-bearer for a long time, being developed originally by Ethan Galstad and included in Debian and Ubuntu for quite some time. In 2007, Ethan created a company built around providing enhancements to Nagios called Nagios Enterprises. However, for several years now there have been competitors to the original Nagios.

The first to come along was Icinga. This was a direct fork of the Nagios code that happened in May of 2009; the story of what lead to the fork was admirably reported by Free Software Magazine in April of 2012. In short, many developers were unhappy with the way that Nagios was being developed and with what they perceived as its many shortcomings which Ethan could not or would not fix. From Ethan’s standpoint, it was more about the enforcement of the Nagios trademark. The article summed it up best at the end: it’s complicated.

The H-Online also had an interview with Ethan Galstad about the future of Nagios and some of the history of the project.

Icinga is now in Ubuntu Universe and has been since Natty. It is also available for Debian Squeeze (current stable release).

Another project is Shinken: rather than a fork, it is a compatible replacement for the core Nagios code. When the Python-based Shinken code was rejected (vigorously) in summer of 2010 as a possible Nagios 4, it became an independent project. This project is newer than Icinga, but shows serious promise. It too, is now available in Ubuntu Universe and in Debian Wheezy (current testing release).

It is unfortunate that such animosity seems to swirl about Nagios; however, Icinga and Shinken appear to be quite healthy projects that provide much needed enhancements to Nagios users – and both are available in Ubuntu Precise Pangolin, the most recent Ubuntu LTS release.

I don’t know if Icinga or Shinken still work with Nagios mobile applications. If it’s just the URL, then the web server could rewrite the URL; if there is no compatible page for the mobile applications, then they can’t be used. However, I’d be surprised if there was no way to get the mobile apps working.

I’m going to try running Shinken and/or Nagios on an installation somewhere; we’ll see how it goes. I’ll report my experiences at a later date.

What’s Wrong with Nagios? (and Monitoring)

Nagios (or its new off-spring, Icinga) is the king of open source monitoring, and there are others like it. So what’s wrong with monitoring? Why does it bug me so?

Nagios is not the complete monitoring solution that many think it is because it can only mark the passing of a threshold: there is basically only two states: good and not good (ignoring “warning” and “unknown” for now).

What monitoring needs is two things: a) time, and b) flexibility.

Time is the ability to look at the change in a process or value over time. Disk I/O might or might not be high – but has it been high for the last twenty minutes or is it just on a peak? Has disk usage been slowly increasing or did it skyrocket in the last minute? This capability can be provided by tools like the System Event Correlator (SEC). The biggest problem with SEC is that it runs by scanning logs; if something isn’t logged SEC won’t see it.

The second thing is what drove me to write: there is no flexibility in these good/not-good checks that Nagios and its ilk provide. There is also not enough flexibility in SEC and others like it.

What is needed is a pattern recognition system – one that says, this load is not like the others that the system has experienced at this time in the past. If you look at a chart of system load on an average server (with users or developers on it) you’ll see that the load rises in the morning and decreases at closing time. When using Nagios, the load is either good or bad – with a single value. Yet a moderately heavy load could be a danger sign at 3 a.m. but not at 11 a.m. Likewise, having 30 users logged in is not a problem at 3 p.m. on a Tuesday – but could be a big problem at 3 p.m. on a Sunday.

What we really need is a learning system that can match the current information from the system with recorded information from the past – matched by time.

It’s always been said that open source is driven by someone “scratching an itch.” This sounds like mine…