jan2004.tar

Reindexing MRTG: An Ops Control Panel

Ben Stern

When monitoring a large network of systems, two utilities that come to mind are Tobi Oetiker's Multi-Router Traffic Grapher (MRTG) and his RRDTool system. Both of these tools can store statistics in a series of data files. However, when it comes to actually displaying the data, there are far fewer display front-ends for MRTG than for RRDTool. In this article, I will discuss a new front-end for MRTG data files that allows a broader understanding of the state of the network than most previous tools.

How MRTG Works

MRTG consists of two major components, both of which are launched by MRTG's main program. The first program collects and stores SNMP data, using Perl's Net::SNMP, and the second collates stored data and generates graphs, using a custom C program (for speed reasons) and Thomas Boutell's GD graphics library. Of course, to view MRTG's output, a Web server must have access to the generated files.

Usually, to monitor a system with MRTG, the system in question needs to run an SNMP agent. Most networking infrastructure includes an SNMP agent in the host software, although it must be configured to allow the monitoring system access to the information. SNMP agents for both Unix and Windows-based systems are available, often as a part of the pre-loaded software. However, if SNMP is not possible, for whatever reason, MRTG can also run commands directly on its host system, which could collect local statistics or use an out-of-band method to acquire statistics from other systems.

Once MRTG has collected data points, the data can be stored in a variety of ways. (For example, the data files can be post-processed, to insert them into a database.) At the time of this writing, MRTG has only two native formats for data files in which it can store data: the traditional MRTG flat-text format, and the new RRDTool binary format. If the RRDTool format is used or, if the MRTG data is post-processed by an independent program, MRTG must be used solely as a data collector and some other method must be used to generate graphs.

This article will focus upon the traditional file format, because a plethora of RRDTool front-ends exist for displaying collected data in the RRDTool format. Note that MRTG's data files (both text and binary) are slightly lossy: data is interpolated, especially older data points, and the numbers stored are best used to see baselines, rather than precise traffic amounts. (The exceptions are the first and second lines in the flat-text representation, which store the most accurate values of the last poll MRTG conducted.)

Although MRTG does its job of storing and graphing well, many sites are making the switch to RRDTool-backed data. Now that MRTG can target RRDTool directly, RRDTool serves as an excellent middle ground between staying with a one-stop solution of MRTG and moving to a whole new data collection platform. Furthermore, because RRDTool was also written by Tobi Oetiker, RRDTool is a trusted collection system, and has some notable next-generation benefits. Reading large amounts of data from it (for graph generation, among other things) is noticeably faster, simply because the RRDTool file format is binary rather than text-based.

Also, RRDTool can support a wider variety of data points, such as floating-point numbers instead of just integers. With RRDTool, performing mathematical transformations on stored data before graphing becomes substantially simpler. Finally, RRDTool explicitly decouples data storage and graph generation from data collection, so there is less precedent for using a well-known set of stock views. As a result, a large number of front-ends for RRDTool files have appeared, which provide a larger choice of data overviews and more customizable graphs.

Notwithstanding all of these reasons to use RRDTool-backed data, many sites still use MRTG's flat-text format. This format is quite easy to parse by automated tools, and many sites using MRTG have a large number of scripts and programs developed in-house to support a comprehensive monitoring platform. Although the formatting of the details MRTG provides is not the blank slate that RRDTool allows, the legends and information included with MRTG-generated graphs can be customized to a great extent via the configuration file.

Displaying Collected Data

If you choose to use MRTG, or you are already using it in a legacy environment, you can build your own data display system or choose from several pre-designed front-ends (see sidebar "MRTG Front-End Options"). A front-end is not strictly required, because MRTG can (and usually does) pre-generate Web pages containing collected data. However, using a summary page that is more visually appealing and informational than a simple directory listing can be beneficial.

All of the existing MRTG front-ends have their advantages, but none of them fully addressed my needs. I developed the Ops Control Panel (OCP) because I needed a solution that offered an easy-to-alter presentation of data. I also needed a readable real-time overview of system performance. Problems had to be immediately visible, and output had to formatted so that it was easy to compare similar systems without being overwhelmed.

The Ops Control Panel

OCP was originally designed to serve as an MRTG index that can be refreshed throughout the workday, so that users can view a complete overview of performance in almost real-time. Additionally, it had to indicate problems quickly, and provide enough information for operations staff to take troubleshooting steps immediately. Effectively, all of these requirements stem from the principle of "make it as simple as possible, but no simpler." OCP is available for download under the terms of the GNU GPL from http://www.fort-tech.com/software/ocp/.

As OCP has evolved, all of these features have remained at the core of the design. A single page lets the user make a quick assessment of all monitored systems. If a problem is detected, the user can quickly check other variables from that host or that cluster to see if there is a correlation, or drill down directly into the problem to see how long it has been going on.

OCP uses the second line of MRTG's output files, which contain the least interpolated results of the last data poll. As a result, OCP provides nearly up-to-the-minute snapshots of performance, as opposed to the slightly more general graphs that MRTG provides. The explicit clustering of related variables allows "group views" of server farms, related systems, or similar datasets, so that off-the-cuff "apples to apples" comparisons can be made. Also, adding a new set of variables is a simple process, consisting of adding a new group in the configuration file and finding a place for them in the user-created page template.

OCP can be used to present any number of systems. Displays can be broken out into a series of pages, allowing huge data sets to be presented individually or all at once, depending upon the user's preference. When all systems are operating within the expected range, the display will resemble that shown in Figure 1.

The table shown is one of the clusters an OCP-generated Web page could display. The clusters are configured to display a single variable on all monitored systems, as indicated by the table's title. Each cell contains the name of the system being monitored, the last time data was retrieved, and the most recent values of the monitored data. The name is also a link to the MRTG-generated Web page that displays the history of the variable on that system. At the bottom of each cluster is a link to the OCP script that will display all of the most recent MRTG graphs for this cluster on a single page. This allows the user to quickly see a historical comparison of all systems. The display is useful in determining whether overall load has increased, whether the trouble thresholds ought to be readjusted, or just to see that a single system has been out of tolerance for a little while.

An alternate configuration, which clusters variables by system, is shown in Figure 2. This sort of clustering could be useful if related variables are being collected. For example, notice that the timestamps for both "CPU Load" and "RAM in Use" are printed boldly and in red. It is likely that the system is swapping heavily, and the administrator should immediately check for runaway processes.

Typical OCP pages contain several clusters, aiding in comparisons across the board. This not only allows quick comparisons between similar machines, but can also be used to form an overall impression regarding the performance of the entire network.

Dependencies

OCP needs roughly the same software as MRTG. A Web server, such as Apache, with access to the MRTG files is needed, both to use MRTG effectively and to allow the critical data to be read by the OCP for display to the user. OCP itself is either a PHP 4 script or a CGI script. Depending upon which version is used, the interpreter in question must be executable by the Web server's userid.

MRTG must already be configured normally for the host network. This configuration is wholly independent of the OCP configuration, since the OCP does no actual data collection itself, only interpretation. A number of resources exist to help configure new MRTG installations. Normal configuration also includes either adding a cron job to run MRTG regularly, or configuring MRTG to run as a daemon and starting it from an rc script to ensure that data flows into the system.

Configuration of the OCP

Once MRTG is running correctly and monitoring your systems, the OCP must be configured. There are two types of configuration files for the OCP. The first, the global configuration file, specifies the variables being monitored and a grouping for each set of variables. It also names the location of the template files to minimize the impact upon system security.

The grouping used by OCP allows related variables to be displayed as flexibly as possible. Because MRTG usually monitors multiple variables per host, the OCP's groups can be used to flexibly aggregate data however the administrator chooses. A portion of the sample configuration file follows. (In this example, MRTG's data files for each system happen to be stored in subdirectories.)

TEMPLATE="ocp.template";
TEMPLATEDIR="/usr/local/ocp";

# Specify the title of a class.
NETLOAD="Network Load";
# Add some systems traffic in bits
NETLOAD("marianne eth0", marianne/marianne_2, bits, 8388608);
NETLOAD("holly hme0", holly/holly_2, bits, 8388608);
# We don't care how much traffic amanda passes.
NETLOAD("amanda eth0", amanda/amanda_2, bits);
NETLOAD("amanda eth1", amanda/amanda_3, bits);
...
CPU="System Load";
CPU("marianne", marianne/marianne_cpu, load);
# If load on holly gets too high, visually differentiate it.
CPU("holly", holly/holly_cpu, load, 100);
...
TEMP="System Temperature";
TEMP("amanda Fan0", amanda/amanda_fan0, "&deg;&nbsp;C", 85);

Other than the keywords TEMPLATE and TEMPLATEDIR, the names of classes can be any word. Normally, the TEMPLATE is set to the filename of the template file to use. However, if the template is set to the reserved value "CGI", the OCP will expect to get the name of the template file in an HTTP GET or POST variable named template. This allows multiple templates to be created that only show selected variables or machines, by either including links terminated with "?template=anothertemplate", or with an HTML form that sets the template variable. The TEMPLATE may appear anywhere in the file but must not be repeated.

The TEMPLATEDIR reserved keyword specifies the fully qualified path in which to look for OCP templates. This is required to prevent users from specifying, for example, a template of /etc/passwd and collecting information to which they should not have access. This keyword may appear anywhere in the configuration file but must not be repeated.

The rest of the global configuration file contains class definitions, made up of both titles and contents. It may contain comments at either the beginning of a line or after a semicolon. Comments start with a pound sign and go to the end of the line. Data may be split across multiple lines. All data lines must end with a semicolon.

The syntax for class titles is

CLASSNAME="class title";

These must be specified at any point in the configuration file before the first data element is added. To actually add data to a class, use:

CLASSNAME("name", logs, type[, limit]);

The name is the title of the link to the MRTG-generated Web page for this variable. The second argument is the path to the series of files MRTG creates when it monitors a variable. To get this name, use the same path that MRTG uses, followed by the name of the variable MRTG is monitoring. By following these rules, if ".log" is appended to the given text, the string will be the MRTG logfile name; if ".html" is appended, it will be the Web page MRTG generated. The type can be either a keyword defining the type of data, or freeform text, in double quotes, which will be displayed after the variable's most recent value. Finally, if the limit is provided, this number will be the "warning threshold," and any values above this number will cause the timestamp that the OCP prints to appear in a bolded red font. Any entries that appear in quotes in the configuration file may contain HTML (except for the reserved keywords, which must match local filesystem requirements).

Predefined data types include bits, kilobytes, percent, and load. Each of them alters the data output somewhat. Supplying "bits" causes data to be displayed as a number followed by b/s, Kb/s, or Mb/s. Similarly, "kilobytes" will provide the units of KB or MB, depending upon the size of the displayed value, but the second value is omitted from display. This is because variables measured in kilobytes are often used as a measurement of memory, and there is generally only one useful data point for such measurements. Using "load" will cause both collected data to be displayed as normal Unix load averages, while "percent" displays only the first number, as a percentage value. (Again, most data that are measured as a percentage do not come only in pairs.) If a type is provided in quotes, that text will appear after the first data point, and the second will be omitted from display.

Once the configuration file has been filled in, it's time to edit the other type of file that the OCP uses: a template. Templatizing the output increases the flexibility of the OCP's output appearance. These files are written in normal HTML, except that they may have <ocp> tags in them. The syntax of this tag is <ocp CLASSNAME=numcols>. That is, to insert a table with "NETLOAD" data into an OCP Web page, one would use <ocp NETLOAD=2> (for two-column output). Data is filled in with an alternating light background in each cell, to increase visual differentiation without being overly distracting. Extra cells (if the number of elements in a class is not a multiple of the number of columns in the table) are left blank.

Finishing Up

All that remains is to add the OCP PHP script or Perl script to the Web server. This is as simple as adding links to it from an existing page, or internally publishing the link or links, depending upon how your site normally handles monitoring. If you need to protect the data, normal access control mechanisms should be applied, which might include multiple templates and CGI to ensure that various users can only see the templates intended for them.

Useful Links

MRTG -- http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/

GD -- http://www.boutell.com/gd/

14all -- http://my14all.sourceforge.net/

MRTG Viewer -- http://noc.asti.dost.gov.ph/docus/tools/mrtg/mrtgviewer.php

OCP -- http://www.fort-tech.com/software/ocp/

Ben Stern has been a programmer and systems administrator for the past seven years, with a focus on highly available DNS architecture. He has deployed these tools on nationwide networks, and used them operationally to provide early detection of unusual usage patterns and network instability. For the past two years, he has been writing commercial-grade free software for use in high-performance environments.