Nagios System and Network Monitoring

Nagios System and Network Monitoring 116

Posted by samzenpus on Wednesday April 11, 2007 @03:22PM from the keep-an-eye-on-things dept.

David Martinjak writes "Nagios is an open source application for monitoring hosts, services, and conditions over a network. Availability of daemons and services can be tested, and specific statistics can be checked by Nagios to provide system and network administrators with vital information to help sustain uptime and prevent outages. Nagios: System and Network Monitoring is for everyone who has a network to run." Read on for the rest of the review.

Nagios: System and Network Monitoring
author	Wolfgang Barth
pages	464
publisher	No Starch Press
rating	9
reviewer	David Martinjak
ISBN	1593270704
summary	Covers installing, configuring, and deploying Nagios to monitor systems and services on a network.

The book is authored by Wolfgang Barth and published by No Starch Press. The publisher hosts a Web page which contains an online copy of the table of contents, portions of reviews, links to purchase the electronic and print versions of the book, and a sample chapter ("Chapter 7: Testing Local Resources") in PDF format.

An amusing note to begin: this is one of the only books I have read where the introduction was actually worth reading closely. Many books seem to talk about background or history of the subject without providing much pertinent information, if any at all. In Nagios: System and Network Monitoring, Wolfgang Barth begins with a hypothetical anecdote to illustrate the usefulness of Nagios. The most important section in the introduction, however, is the explanation of states in Nagios. While monitoring a resource, Nagios will return of one of four states. OK indicates nominal status, WARNING shows a potentially problematic circumstance, CRITICAL signifies an emergency situation, and UNKNOWN usually means there is an operating error with Nagios or the corresponding plugin. The definitions for each of these states are determined by the person or team who administers Nagios so that relevant thresholds can be set for the WARNING and CRITICAL status levels.

The first chapter walks the reader through installing Nagios to the filesystem. All steps are shown, which proves to be very helpful if you are unfamiliar with unpacking archives or compiling from source. Users who are either new to Linux, or cannot install Nagios through a package manager, will appreciate the verbosity offered here. Fortunately, the level of detail is consistent through the book.

Chapter 2 explains the configuration structure of Nagios to the reader. This chapter may contain the most important material in the book as understanding the layout of Nagios is essential to a successful deployment in any environment. The book moves right into enumerating the uses and purposes of the config files, objects, groupings, and templates. All of this information is valuable and presented in a descriptive manner to help the reader set up a properly configured installation of Nagios. My biggest stumbling block in using Nagios was wrapping my brain around the relationships of the config files and objects. This chapter clears up all of the ambiguities I remember having to work out for myself. If only this book had been around a few years ago!

The sixth chapter dives into the details of plugins that are available for monitoring network services. This chapter explains using the check_icmp plugin to ping both a host and a specific service for verifying reachability. Additional examples include monitoring mail servers, LDAP, web servers, and DNS among others. There is even a section for testing TCP and UDP ports.

Next, the book covers checking the status of local resources on systems. At work, we have a system in production that could have been partitioned better. Unfortunately, /var is a bit smaller than it should be, and tends to fill up relatively frequently. Thankfully, Nagios can trigger a warning when there is a low amount of free space left on the partition. From there, we have Nagios execute a script that cleans out certain items in /var so we don't have to bother with it. We can also receive notification if the situation does not improve, and requires further attention. In addition to monitoring hard drive usage, the book includes examples for checking swap utilization, system load, number of logged-in users, and even Nagios itself.

Chapter 12 discusses the notification system in Nagios. You provide who, what, when, where, and how in the configs, and Nagios does the rest. The book does a fantastic job of explaining what exactly triggers a notification, and how to efficiently configure Nagios to ensure the proper parties are being informed of relevant issues at reasonable intervals. For example, the server team might be interested to know that /var is 90% full on one of the LDAP servers; however they don't need to be notified of this every thirty seconds. This chapter also covers an important aspect of Nagios known as flapping. Flapping occurs when a monitored resource quickly alternates between states. Nagios can be configured for a certain tolerance against rapid alternating changes in states. This means Nagios won't sound the alarm if the problem will resolve itself in a short period of time. Usually flapping is caused by an external factor temporarily influencing the results of the test from Nagios; and therefore has no long-term impact.

The last major chapter to mention here deals with essentially anything and everything about the Nagios Web interface. The main point of interaction between the administrator and Nagios is the fully featured Web interface. This chapter covers recognizing and working on problems, planning downtimes, making configuration changes, and more. I especially like that the book gives an overview of each of the individual CGI programs that the Web interface is composed of; as these files are important for UI customization.

The only aspect of this book that I did not care for was that the book reads like a reference manual at times. The first several chapters start out more conversational in tone with great explanations of the procedures and files; but later it sometimes feels like I am repeatedly reading an iterated piece-by-piece structure, filled in with the content for that chapter. That is not necessarily bad all together as it does provide consistency in the presentation of the information. Additionally, the level of detail is outstanding throughout the book. The explanations are never too short or too long. This is definitely a valuable book for administrators at all levels with fantastic breadth and depth of material. Administrators who are interested in proactive management of their systems and networks should be pleased with Nagios: System and Network Monitoring.

Nagios is licensed under the GNU General Public License Version 2, and can be downloaded from http://nagios.org.

David Martinjak is a programmer, GNU/Linux addict, and the director of 2600 in Cincinnati, Ohio. He can be reached at david.martinjak@gmail.com.

You can purchase Nagios: System and Network Monitoring from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Nagios System and Network Monitoring

This discussion has been archived. No new comments can be posted.

Search 116 Comments Log In/Create an Account

Comments Filter:

Old NetSaint and Nagios geek comments (Score:5, Informative)

by Anonymous Coward writes: on Wednesday April 11, 2007 @03:50PM (#18693893)

Please forgive my anonymous coward use: my comments would reveal my name too well.

I'm an *OLD* Netsaint and Nagios user, and have contributed to both. Guides are great, playing with it is great, and it does a lot of things very well. But what Nagios has never had is a way to publish the URL's of specific queries or reports in a way that can be bookmarked and sent to someone else for reference. It's a big, big, big flaw in the system, common to a lot of web-based projects.

The other huge, huge flaw of Nagios is configuring it. It shouldn't take a reference book from O'Reilly to do this efficiently, but I'm afraid it does. There are easily a dozen different configuration tools at www.nagiosexchange.org and sourceforge.net, and *every single one of them* has major problems that could be solvd with 10% of the time spent on Nagios itself. Most are abandonware, exciting but uncompleted projects that are never going to be completed. Others rely on hand-compiling Nagios itself with strange local modifications and local configurations that are very difficult to import a working Nagios to, or export from. Others have absolutely *no* security model, incapable of securing access to them or relying on locally stored plain-text password setups: others rely on non-privileged accounts to edit the Nagios configurations, including the password files for databases or proxy services, in semi-public repositories. Others rely on installing every file in a browseable web directory, permitting local unauthorized to poke the guts of and use the security flaws. (Yes, you perl idiots who execute random file and directory creation without checking if it's empty first or protecting it from being written into by other people before you copy its contents, I mean you!)

Other configuration tools have beautiful "artist conception" interfaces that will make your eyes bleed aft 20 minutesworking with it. Every last one of them listed at Sourceforge and NagiosExchange suffer from one or many more of the major open source GUI flaws Eric Raymond ranted about in hisi CUPS horror story, years ago.

It's unfortunately so bad that I've had to throw away weeks of work and switch to Altiris on a major project, which is fairly painful to switch to but at *LEAST* has a usable interface.

Others (Score:4, Informative)

by Colin Smith ( 2679 ) writes: on Wednesday April 11, 2007 @03:51PM (#18693903)

zabbix
jffnms
opennms

etc.

I found nagios rather clunky compared to some of the others.

Re:Old NetSaint and Nagios geek comments (Score:3, Informative)

by walt-sjc ( 145127 ) writes: on Wednesday April 11, 2007 @04:08PM (#18694095)

I've been using nagios for nearly 2 years too, to monitor about 80 servers. Also running the NRPE plugins to monitor things like disk space, load, and a number of other aspects.

I agree that the configuration is pretty bad, and your other points on the interface. Dependencies are a nightmare to configure.

That said, it does work, and requires very little maintenance once it's setup. It helps to use one file per server too, since you can include entire directories that contain configuration files. What I did was write a simple perl script that I "check off" which services I want to monitor, and it creates the nrpe.conf and nagios conf file for each specific machine. Frequently have to hand-tweak though for the dependencies.

I never read any book on it, just the base docs. A book would have helped. I also haven't found any good open source alternative however.

From 0 to Monitoring and Alerting in 30 minutes (Score:2, Informative)

by Jick ( 29139 ) writes: on Wednesday April 11, 2007 @05:21PM (#18695063) Homepage

I'm surprised people still use these 'svn co && ./configure && make install && edit config files' systems. You can download Hyperic HQ, install it, and be monitoring your software and hardware in 30 minutes -- no joke. Want alerts when your disks are full? Cake. Want to autodiscover your Apache server? Cake. Want an alert when a process goes haywire? Cake.

And since it has a pluggable framework, you can monitor anything that you want -- network devices, software, hardware, etc.

It's Open Source and has an active community, so if you really long for the days of 'svn co', that's also provided.

Disclaimer: I work for Hyperic ... and it's objectively better.

Re:Old NetSaint and Nagios geek comments (Score:2, Informative)

by Anonymous Coward writes: on Wednesday April 11, 2007 @05:52PM (#18695445)

I've got to agree. We use it at an ISP level to monitor various functions, both leased line and server functions based on customised scripts, easily several thousand devices are being monitored primarily through Nagios. The theory being we can contact customers pro-actively when they experience connectivity issues, as a free function of business. As a natural side effect of having acquired other ISPs over the years our monitoring system is multi-faceted depending on each ISPs platform quirks. Great for them, a PITA for us that now have to monitor multiple copies of Nagios/Netsaint. Whilst the situation is none of the Nagios developers fault, as someone that routinely uses them we can see issues on ease of use that could do with some improvements. In an attempt to consolidate this one of our engineers using a mix of open source code and some of his own produced a DB based back end for Nagios that on a cron'd basis produces new config files for Nagios.
The platform has been running for a while and stuff is getting transferred across to it, but we keep hitting occasional almost inexplicable quirks with Nagios; such as bizarre limits on the number of characters that can be used in a string of host names, which required editing of the source code and recompilation of Nagios! No obvious reason for it to be as low as it was, plenty of people seem to be butting their heads against it too. One of the most useful changes the engineer made was a link back from Nagios to the database, so when an alarm occurs its a two click process to view the specifics of the alarm, and then view the details in the database relating to that device. The resulting hybrid system is too customised unfortunately for it to be appropriate to be released as open source.

The whole interface looks, to be frank, ugly and extremely dated. Tactical overview is still too crowded a display.
The lack of a simple quick visual history is detrimental to trend analysis. On one of our paid for (license per machine) monitoring platform it takes barely a minute to view anything from a 24 hour view through to year long views for the devices, great for spotting quirks like customers always turning off routers at end of play before a bank holiday weekend for example, or quirks with a server always occurring at the same time each day. Instead to do trend analysis one is forced to work your way through a not all helpful textual history. Cacti, a free rrdtool based SNMP monitoring platform produces graphs quite happily, so its not as if its unheard of in the open source community either.

Most modern monitoring systems use simple interfaces for managing devices.. Nagios is stuck with what can be annoying files to edit. Want to add a device? Got to do it by hand, making sure to add it in all the locations it needs to be added in. Same for removing, you've got to find every instance of that device in a text file or you're stuffed. Rule of thumb: backup the text file first before editing it. The verify tool is helpful, but still on a modern system is beyond what any user should be expected to handle, and instantly raises the required technical ability of the operator or maintainer. In my opinion monitoring should be a no brainer process: See alarm, inform relevant party (customer, network team, server maintainers.) Adding and removing devices should also not be a difficult task. A name, an IP, and a few tick boxes for choice of monitoring functions should be all that is necessary for the day to day work. To require someone to edit the text file is utterly ridiculous in this day and age.
If the Nagios project wishes to continue to be of as much use as it has been in the past and continue to be used by companies then routine comparisons with other projects are essential and stubbornness and refusal to change core methodology of the platform should be overthrown if it results in positive and ongoing improvements to the platform, even if general and feature progress has to take a back seat to it. It seems to be a common issue that with many programs the incentive is more to add features than deal with fundamental operational processes. I guess once you've written code to do something once it must be quite boring to re-write it again.
Read the rest of this comment...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Nagios System and Network Monitoring 116

Nagios System and Network Monitoring More Login

Nagios System and Network Monitoring

Old NetSaint and Nagios geek comments (Score:5, Informative)

Others (Score:4, Informative)

Re:Old NetSaint and Nagios geek comments (Score:3, Informative)

From 0 to Monitoring and Alerting in 30 minutes (Score:2, Informative)

Re:Old NetSaint and Nagios geek comments (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot