What’s Wrong with Nagios? (and Monitoring)

Nagios (or its new off-spring, Icinga) is the king of open source monitoring, and there are others like it. So what’s wrong with monitoring? Why does it bug me so?

Nagios is not the complete monitoring solution that many think it is because it can only mark the passing of a threshold: there is basically only two states: good and not good (ignoring “warning” and “unknown” for now).

What monitoring needs is two things: a) time, and b) flexibility.

Time is the ability to look at the change in a process or value over time. Disk I/O might or might not be high – but has it been high for the last twenty minutes or is it just on a peak? Has disk usage been slowly increasing or did it skyrocket in the last minute? This capability can be provided by tools like the System Event Correlator (SEC). The biggest problem with SEC is that it runs by scanning logs; if something isn’t logged SEC won’t see it.

The second thing is what drove me to write: there is no flexibility in these good/not-good checks that Nagios and its ilk provide. There is also not enough flexibility in SEC and others like it.

What is needed is a pattern recognition system – one that says, this load is not like the others that the system has experienced at this time in the past. If you look at a chart of system load on an average server (with users or developers on it) you’ll see that the load rises in the morning and decreases at closing time. When using Nagios, the load is either good or bad – with a single value. Yet a moderately heavy load could be a danger sign at 3 a.m. but not at 11 a.m. Likewise, having 30 users logged in is not a problem at 3 p.m. on a Tuesday – but could be a big problem at 3 p.m. on a Sunday.

What we really need is a learning system that can match the current information from the system with recorded information from the past – matched by time.

It’s always been said that open source is driven by someone “scratching an itch.” This sounds like mine…

18 thoughts on “What’s Wrong with Nagios? (and Monitoring)”

  1. I’m not sure how nagios implements node monitoring but wouldn’t the stuff you talk about be implemented via snmp traps and with logic you can configure there?

    1. I don’t believe so. Aren’t SNMP traps triggered in the same way?

      In any case, Nagios support for SNMP traps is not simple or native – at least, that’s the way its been. It certainly doesn’t come included with Nagios and it doesn’t come included with Net-SNMP.

  2. Zabbix should be able to handle most of what you mention. It has multiple alert levels and you can define trigger dependencies so that trigger A only becomes active when its dependency trigger B is also active.

  3. You should have a look at Check_MK. Some checks delivered with it (like the interface checks if, if64 and the disk usage checks df) provide trend analysis and average calculations.

    The problem with analyzing trends and average rates is that there is no generic pattern which can be applied to all checks. So one need to think and decide for each check again if and how to implement such things.

    There are several other extensions and addons which can improve Nagios in the direction you like. For example in PNP4Nagios you can realize some sort of pattern recognition (e.g. Holt-Winters: http://www.nagios-portal.org/wbb/index.php?page=Thread&threadID=8297).

    1. We’re using PNP4Nagios, but I’ve not seen any way to alarm based on time data out of PNP4Nagios – it’s a graphing tool, not an analyzer.

      1. That is why I pointed you to Check_MK. It provides trend data and pushes them to Nagios/PNP4Nagios.

  4. Couldn’t agree more.
    many years ago, my company implemented BMC Patrol monitoring and I told the monitoring team
    I didn’t care if a filesystem wnet from 90 -91 %, i needed to knhow HOW FAST IT WAS GROWING.
    Also, % is meaningless when you have very large filesystems…

    1. Exactly. Free space of 5% of a 500M drive would be just fine, but 5% of a 160G drive is just too much. Then again, 5% of a 20M drive isn’t enough…

  5. It’s not like Nagios can’t do that, but what you need are plugins that support measuring over time. There’s no inherent historic statistic storage in Nagios, but there are lots of solutions for storing check results (including performance data) in databases, which would be simple to pull from.

  6. There is a lot of discussion about programs that can handle time-based alarms and notifications, but none of these options will handle pattern recognition – and none have a learning capability.

    The ideal system would have an agent that runs on the system and learns over several weeks the typical workload on the system – storing data points in a database perhaps. It would then continue to record the data points (for other analysis) but would create a pattern to match against.

    This would allow alarms based on responses like “The current system load is 20% larger than normal for a Thursday afternoon.” and “The current disk I/O is 50% higher than normal on that disk on a Saturday morning.”

    This pattern recognition could be combined with time-based alarms too: “The current system load is rising 50% faster than normal for Sunday night.”

    I don’t think there’s anything out there that can do this now.

  7. Thanks for the great post. Even though this thread is old, the point of view is still valid.

    In essence you are interested in pattern recognition which may fit a variety of cases. This pattern recognition would drive the creation of meaningful events that would be interesting to operational or business personnel.

    First this requires a time component and a series of values to work with
    Second there needs to be an analysis framework
    – Anomaly detection (ex. Holt-Winters)
    – Trend directionality (Ex. crossed a threshold and getting worse or getting better)
    – Data relations (Ex. Applies to a single input or also related group of inputs)
    – etc.

    Third is a consistent method. You and one of the commenters hinted on this. The quality of the analysis depends on each check plugin. This equals Plugin Hell. This does not scale and is impossible to maintain.

    The proposed solution is using a daemonized process to query a time-series database that provides access methods to prepare the data for analysis. 1 gold piece to the one who guesses which time-series database I am referring to.

    .. Graphite ..

    Graphite provides the time component and time-series.
    Graphite provides the basic functions for analysis (Holt-Winters forecasting, derivative, integral, mostDeviant, movingAverage, timeShift, etc. There are dozens and dozens of functions)
    Graphite permits applying functions to groups of time-series (ex. datacenter1.servers.*.cpu)
    Graphite provides an simple access method (URL API) and efficient data return (JSON)
    Graphite supports load-balanced, HA databases supporting update rates in excess of anything an SQL database can support (80K updates per second using caching to optimize disk writes).

    Another gold piece to the one who can guess which Nagios inspired solution would implement this.

    .. Shinken ..

    Shinken provides the facilities for daemonized checks, called poller modules. (There exists a number of them already)
    Shinken provides the architecture for distributed daemonized checks.
    Shinken provides the architecture for HA distributed daemonized checks
    Shinken integrates very well the exporting of data to Graphite
    Shinken integrates the display of data from Graphite directly in the Shinken WebUI
    Shinken supports templates for displaying data from Graphite

    The framework is there. Shinken and Graphite provide an innovative real world solution available TODAY.

    The complete workflow looks like this :

    1 Data acquisition and normal thresholding by Shinken
    2 Data is processed by the event management system of Shinken
    3 Data exported real time to Graphite for storage into one or more time series databases (Whisper)
    4 Daemonized PatternRecognition poller module queries Graphite via its URL API
    5 Data returned by Graphite is analyzed and sent on to be processed by the event management framework
    2.5, 5.5 Notifications are sent to personnel as appropriate

    Do not mistake this for event stream processing or CEP processing. This is the closest you can get to that without the crazy complexity or the need for a statistics major.

    And a final gold piece to anyone who knows without googling what the Mad Overlord referrers to. 🙂 Think grid paper and a monochrome screen… Ouch, now I feel like a geezer.

    I hope this gets your juices going. This is one of the reasons why I got onto the Shinken bandwagon. Sure it is a young project, but the innovation packed into this monitoring system gives us hope that Nagios hell will be a distant memory, while still making use of all that built up knowledge.

    1. Sounds interesting, though I’m not sure how it is an improvement over PCP or HP Measureware. I don’t see where it provides time-based alerting (such as “WARNING: Disk I/O excessive for the last 10 minutes”.) Beyond that, there still really isn’t anything that provides alarms like “WARNING: anomalous and unexpected behavior for this system: load over 20”.

      Shinken has a home page, as does Graphite. Neither is available from the Ubuntu repositories. Shinken appears to be compatible with Nagios, which makes it a competitor to Icinga as well. Icinga was created when major Nagios developers couldn’t get updates into Nagios; Shinken was created when a rewrite of Nagios was rejected.

      Graphite appears to be a competitor to PNP4Nagios, but I’m not sure.

      I think this demands further investigation and another post or two.

      1. I will let you decide if it is a technical improvement, though not a usability one most probably as the projects are open-source :
        Graphite’s Whisper database is a true time-series database. Fixed sized, automatic aggregation to less granular form, supports up to 80K updates per second, 1 million datapoints and more. And access time does not degrade as the databases fill up. Try that with any SQL based product. Can be backed up, rsynced, databases for the same time series can be merged during runtime and the storage and aggregation algorithms can easily be modified/extended.

        The analysis Shinken module doesn’t not exist yet, but everything is there to support it. It certainly has picked my interest and I am putting my support behind it in word and deed.

        I think you will find it interesting. It certainly warrants further testing.

        There are also projects existing and upcoming that mix RabbitMQ, ESPER, Shinken and Graphite; You could also think of workflows like Shinken + Splunk + Graphite.

        Graphite provides a URL API.
        This API permits the user to get data back in the form of JSON, pickle, PNG, csv, etc.
        For the purpose of a plugin, JSON is the safest and probably the fastest.

        This API permits to get apply a function on a time frame, ex. last 10 minutes.
        You then then apply a second function (in the same call), and say for example is the average for this time frame above a fixed threshold.

        You could also do the same but instead of applying it to one server it could be all servers of a cluster, using a * wildcard in the time series name.

        Getting closer to anomaly detection, you can compare multiple timeframes, or you could have those time series on acquisition run through the statistical anomaly detection algorithm Holt-Winters which analyses short and mid term seasonality with confidence bands (upper and lower).

        Look at the list of functions you can use in the Graphite documentation on Readthedocs for version 0.9.10, the current stable release.

        Graphite could be compared to a mix of RRDtool, RRDcache with a web API, rendering frontend with HA and load-balancing that can also store exact time-series (no interpolation like RRDtool). Graphite can also act as a front-end for RRDtool databases.

        PNP4Nagios is a front-end to RRDtool with templating, a very convenient one and well integrated to Nagios, but that’s where it stops.

        Is Shinken + Graphite the end-all? Not really, but it is a very modern and forward looking true open-source complimentary solution. Nothing like the two-headed monster of Nagios+Cacti.

        Shinken is an alternative to all the Nagios and its derived cores. (Notably Icinga and Opsview) The new Python code base is a fraction of Nagios with a lot more functionality, but it does suffer from the project youth. There are bugs in the least used features, or those that lack unit tests, but they get corrected as they are found.

        Have a good one.

    1. Actually, yes – in part. There’s still nothing that will provide “over time” analysis the same way that PCP and Measureware do – but many Nagios developers left and have created Icinga, which implements a new user interface (both old and new) and incorporates PNP4Nagios directly and many other new features. Icinga is now a part of Ubuntu’s repositories (Main I think) and is almost certainly in Red Hat’s repositories as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: