What’s Wrong with Nagios? (and Monitoring)

Nagios (or its new off-spring, Icinga) is the king of open source monitoring, and there are others like it. So what’s wrong with monitoring? Why does it bug me so?

Nagios is not the complete monitoring solution that many think it is because it can only mark the passing of a threshold: there is basically only two states: good and not good (ignoring “warning” and “unknown” for now).

What monitoring needs is two things: a) time, and b) flexibility.

Time is the ability to look at the change in a process or value over time. Disk I/O might or might not be high – but has it been high for the last twenty minutes or is it just on a peak? Has disk usage been slowly increasing or did it skyrocket in the last minute? This capability can be provided by tools like the System Event Correlator (SEC). The biggest problem with SEC is that it runs by scanning logs; if something isn’t logged SEC won’t see it.

The second thing is what drove me to write: there is no flexibility in these good/not-good checks that Nagios and its ilk provide. There is also not enough flexibility in SEC and others like it.

What is needed is a pattern recognition system – one that says, this load is not like the others that the system has experienced at this time in the past. If you look at a chart of system load on an average server (with users or developers on it) you’ll see that the load rises in the morning and decreases at closing time. When using Nagios, the load is either good or bad – with a single value. Yet a moderately heavy load could be a danger sign at 3 a.m. but not at 11 a.m. Likewise, having 30 users logged in is not a problem at 3 p.m. on a Tuesday – but could be a big problem at 3 p.m. on a Sunday.

What we really need is a learning system that can match the current information from the system with recorded information from the past – matched by time.

It’s always been said that open source is driven by someone “scratching an itch.” This sounds like mine…

Preventing Problems (or: How to Appear Omiscient to Your Users!)

When a user comes to you with problems that they are experiencing with one of the servers you manage, what is the first thing that goes through your mind (aside from “How may I help you?”). For me, there are two: “How can I prevent this from happening again?” and secondly, “Why didn’t I know about this already?”

Let us focus on the second of these. If a user is experiencing problems, you should already know – yes, you really should. If the server is down, overloaded, or lagging behind, these are the sorts of things you should already know.

Most servers leave messages in the system syslog or other log files; write or use something that will scan the log files for appropriate entries and send you a warning. SEC (Simple Event Correlator) is one of the best at this.

Another tool that is invaluable for this is Nagios or other monitoring software such as Zabbix or Zenoss. With such software, it is possible to be notified when a particular event occurs, an actual threshold passed.

When a tool like Nagios is combined with SEC, then much more powerful reporting is available. For example, if a normally benign error (ugh! Who said errors were normal?) occurs too many times in a period of time, then the error can be reported to the Nagios monitoring software and someone notified.

Other tools provide system monitoring with time-related analysis. For example, if disk utilization is too high for too long, a warning can be issued. Another example: if too many CPUs average more than 60% utilization for the last 30 seconds, someone could be notified.

HP’s GlancePlus (a part of OpenView which comes bundled with 11i v3) and the now open source tool Performance Co-Pilot (or PCP) from SGI are two that provide these capabilities. They support averaging, counts per minute, and many, many more. PCP comes with support for remote monitoring, so all systems can be monitored (and data archived) in a central location.

Again, these tools can be integrated with SEC or Nagios to send out notifications or post outage notices and so forth.

With tools like these in your arsenal, next time someone comes to you with an outage or sluggish performance complaints, your response can be: “Yes, I’m already working on it.” Your users will think you omniscient!