Preventing Problems (or: How to Appear Omiscient to Your Users!)

26 December 2008

When a user comes to you with problems that they are experiencing with one of the servers you manage, what is the first thing that goes through your mind (aside from “How may I help you?”). For me, there are two: “How can I prevent this from happening again?” and secondly, “Why didn’t I know about this already?”

Let us focus on the second of these. If a user is experiencing problems, you should already know – yes, you really should. If the server is down, overloaded, or lagging behind, these are the sorts of things you should already know.

Most servers leave messages in the system syslog or other log files; write or use something that will scan the log files for appropriate entries and send you a warning. SEC (Simple Event Correlator) is one of the best at this.

Another tool that is invaluable for this is Nagios or other monitoring software such as Zabbix or Zenoss. With such software, it is possible to be notified when a particular event occurs, an actual threshold passed.

When a tool like Nagios is combined with SEC, then much more powerful reporting is available. For example, if a normally benign error (ugh! Who said errors were normal?) occurs too many times in a period of time, then the error can be reported to the Nagios monitoring software and someone notified.

Other tools provide system monitoring with time-related analysis. For example, if disk utilization is too high for too long, a warning can be issued. Another example: if too many CPUs average more than 60% utilization for the last 30 seconds, someone could be notified.

HP’s GlancePlus (a part of OpenView which comes bundled with 11i v3) and the now open source tool Performance Co-Pilot (or PCP) from SGI are two that provide these capabilities. They support averaging, counts per minute, and many, many more. PCP comes with support for remote monitoring, so all systems can be monitored (and data archived) in a central location.

Again, these tools can be integrated with SEC or Nagios to send out notifications or post outage notices and so forth.

With tools like these in your arsenal, next time someone comes to you with an outage or sluggish performance complaints, your response can be: “Yes, I’m already working on it.” Your users will think you omniscient!

Entry Filed under: Customer Service, Monitoring, Performance. Tags: , , , , , , , , .

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


David Douthitt

David is an experienced UNIX and Linux system administrator, a former Linux distribution maintainer, and author of two books ("Advanced Topics in System Administration" and "GNU Screen: A Comprehensive Manual"). View David Douthitt's profile on LinkedIn Support freedom The Internet Traffic Report monitors the flow of data around the world. It then displays a value between zero and 100. Higher values indicate faster and more reliable connections.

Recent Posts

Top Posts

RSS Sharky’s Column!

Calendar

December 2008
M T W T F S S
« Nov   Jan »
1234567
891011121314
15161718192021
22232425262728
293031  

Recent Comments

Peter on Using Open Source in the Enter…
Anthony on About
MikeT on Stress Relief: Laugh Out Loud…
yungchin on Sparse files – what, why…
Randal L. Schwartz on Perl Tidbits: Annoyances and…

Category Cloud

BSD Career Conferences Debian Debugging Disaster recovery Fedora FreeBSD HP-UX Legal Linux MacOS X Mobile Computing Networking OpenBSD OpenSolaris OpenVMS Personal Notes Portable Code Presentations Productivity Programming Red Hat Scripting Security Solaris Storage Tips Ubuntu UNIX

Archives

Feeds

Blogroll

Pages

Meta