Reindexing
MRTG: An Ops Control Panel
Ben Stern
When monitoring a large network of systems, two utilities that
come to mind are Tobi Oetiker's Multi-Router Traffic Grapher (MRTG)
and his RRDTool system. Both of these tools can store statistics
in a series of data files. However, when it comes to actually displaying
the data, there are far fewer display front-ends for MRTG than for
RRDTool. In this article, I will discuss a new front-end for MRTG
data files that allows a broader understanding of the state of the
network than most previous tools.
How MRTG Works
MRTG consists of two major components, both of which are launched
by MRTG's main program. The first program collects and stores SNMP
data, using Perl's Net::SNMP, and the second collates stored data
and generates graphs, using a custom C program (for speed reasons)
and Thomas Boutell's GD graphics library. Of course, to view MRTG's
output, a Web server must have access to the generated files.
Usually, to monitor a system with MRTG, the system in question
needs to run an SNMP agent. Most networking infrastructure includes
an SNMP agent in the host software, although it must be configured
to allow the monitoring system access to the information. SNMP agents
for both Unix and Windows-based systems are available, often as
a part of the pre-loaded software. However, if SNMP is not possible,
for whatever reason, MRTG can also run commands directly on its
host system, which could collect local statistics or use an out-of-band
method to acquire statistics from other systems.
Once MRTG has collected data points, the data can be stored in
a variety of ways. (For example, the data files can be post-processed,
to insert them into a database.) At the time of this writing, MRTG
has only two native formats for data files in which it can store
data: the traditional MRTG flat-text format, and the new RRDTool
binary format. If the RRDTool format is used or, if the MRTG data
is post-processed by an independent program, MRTG must be used solely
as a data collector and some other method must be used to generate
graphs.
This article will focus upon the traditional file format, because
a plethora of RRDTool front-ends exist for displaying collected
data in the RRDTool format. Note that MRTG's data files (both text
and binary) are slightly lossy: data is interpolated, especially
older data points, and the numbers stored are best used to see baselines,
rather than precise traffic amounts. (The exceptions are the first
and second lines in the flat-text representation, which store the
most accurate values of the last poll MRTG conducted.)
Although MRTG does its job of storing and graphing well, many
sites are making the switch to RRDTool-backed data. Now that MRTG
can target RRDTool directly, RRDTool serves as an excellent middle
ground between staying with a one-stop solution of MRTG and moving
to a whole new data collection platform. Furthermore, because RRDTool
was also written by Tobi Oetiker, RRDTool is a trusted collection
system, and has some notable next-generation benefits. Reading large
amounts of data from it (for graph generation, among other things)
is noticeably faster, simply because the RRDTool file format is
binary rather than text-based.
Also, RRDTool can support a wider variety of data points, such
as floating-point numbers instead of just integers. With RRDTool,
performing mathematical transformations on stored data before graphing
becomes substantially simpler. Finally, RRDTool explicitly decouples
data storage and graph generation from data collection, so there
is less precedent for using a well-known set of stock views. As
a result, a large number of front-ends for RRDTool files have appeared,
which provide a larger choice of data overviews and more customizable
graphs.
Notwithstanding all of these reasons to use RRDTool-backed data,
many sites still use MRTG's flat-text format. This format is quite
easy to parse by automated tools, and many sites using MRTG have
a large number of scripts and programs developed in-house to support
a comprehensive monitoring platform. Although the formatting of
the details MRTG provides is not the blank slate that RRDTool allows,
the legends and information included with MRTG-generated graphs
can be customized to a great extent via the configuration file.
Displaying Collected Data
If you choose to use MRTG, or you are already using it in a legacy
environment, you can build your own data display system or choose
from several pre-designed front-ends (see sidebar "MRTG Front-End
Options"). A front-end is not strictly required, because MRTG can
(and usually does) pre-generate Web pages containing collected data.
However, using a summary page that is more visually appealing and
informational than a simple directory listing can be beneficial.
All of the existing MRTG front-ends have their advantages, but
none of them fully addressed my needs. I developed the Ops Control
Panel (OCP) because I needed a solution that offered an easy-to-alter
presentation of data. I also needed a readable real-time overview
of system performance. Problems had to be immediately visible, and
output had to formatted so that it was easy to compare similar systems
without being overwhelmed.
The Ops Control Panel
OCP was originally designed to serve as an MRTG index that can
be refreshed throughout the workday, so that users can view a complete
overview of performance in almost real-time. Additionally, it had
to indicate problems quickly, and provide enough information for
operations staff to take troubleshooting steps immediately. Effectively,
all of these requirements stem from the principle of "make it as
simple as possible, but no simpler." OCP is available for download
under the terms of the GNU GPL from http://www.fort-tech.com/software/ocp/.
As OCP has evolved, all of these features have remained at the
core of the design. A single page lets the user make a quick assessment
of all monitored systems. If a problem is detected, the user can
quickly check other variables from that host or that cluster to
see if there is a correlation, or drill down directly into the problem
to see how long it has been going on.
OCP uses the second line of MRTG's output files, which contain
the least interpolated results of the last data poll. As a result,
OCP provides nearly up-to-the-minute snapshots of performance, as
opposed to the slightly more general graphs that MRTG provides.
The explicit clustering of related variables allows "group views"
of server farms, related systems, or similar datasets, so that off-the-cuff
"apples to apples" comparisons can be made. Also, adding a new set
of variables is a simple process, consisting of adding a new group
in the configuration file and finding a place for them in the user-created
page template.
OCP can be used to present any number of systems. Displays can
be broken out into a series of pages, allowing huge data sets to
be presented individually or all at once, depending upon the user's
preference. When all systems are operating within the expected range,
the display will resemble that shown in Figure 1.
The table shown is one of the clusters an OCP-generated Web page
could display. The clusters are configured to display a single variable
on all monitored systems, as indicated by the table's title. Each
cell contains the name of the system being monitored, the last time
data was retrieved, and the most recent values of the monitored
data. The name is also a link to the MRTG-generated Web page that
displays the history of the variable on that system. At the bottom
of each cluster is a link to the OCP script that will display all
of the most recent MRTG graphs for this cluster on a single page.
This allows the user to quickly see a historical comparison of all
systems. The display is useful in determining whether overall load
has increased, whether the trouble thresholds ought to be readjusted,
or just to see that a single system has been out of tolerance for
a little while.
An alternate configuration, which clusters variables by system,
is shown in Figure 2. This sort of clustering could be useful if
related variables are being collected. For example, notice that
the timestamps for both "CPU Load" and "RAM in Use" are printed
boldly and in red. It is likely that the system is swapping heavily,
and the administrator should immediately check for runaway processes.
Typical OCP pages contain several clusters, aiding in comparisons
across the board. This not only allows quick comparisons between
similar machines, but can also be used to form an overall impression
regarding the performance of the entire network.
Dependencies
OCP needs roughly the same software as MRTG. A Web server, such
as Apache, with access to the MRTG files is needed, both to use
MRTG effectively and to allow the critical data to be read by the
OCP for display to the user. OCP itself is either a PHP 4 script
or a CGI script. Depending upon which version is used, the interpreter
in question must be executable by the Web server's userid.
MRTG must already be configured normally for the host network.
This configuration is wholly independent of the OCP configuration,
since the OCP does no actual data collection itself, only interpretation.
A number of resources exist to help configure new MRTG installations.
Normal configuration also includes either adding a cron job to run
MRTG regularly, or configuring MRTG to run as a daemon and starting
it from an rc script to ensure that data flows into the system.
Configuration of the OCP
Once MRTG is running correctly and monitoring your systems, the
OCP must be configured. There are two types of configuration files
for the OCP. The first, the global configuration file, specifies
the variables being monitored and a grouping for each set of variables.
It also names the location of the template files to minimize the
impact upon system security.
The grouping used by OCP allows related variables to be displayed
as flexibly as possible. Because MRTG usually monitors multiple
variables per host, the OCP's groups can be used to flexibly aggregate
data however the administrator chooses. A portion of the sample
configuration file follows. (In this example, MRTG's data files
for each system happen to be stored in subdirectories.)
TEMPLATE="ocp.template";
TEMPLATEDIR="/usr/local/ocp";
# Specify the title of a class.
NETLOAD="Network Load";
# Add some systems traffic in bits
NETLOAD("marianne eth0", marianne/marianne_2, bits, 8388608);
NETLOAD("holly hme0", holly/holly_2, bits, 8388608);
# We don't care how much traffic amanda passes.
NETLOAD("amanda eth0", amanda/amanda_2, bits);
NETLOAD("amanda eth1", amanda/amanda_3, bits);
...
CPU="System Load";
CPU("marianne", marianne/marianne_cpu, load);
# If load on holly gets too high, visually differentiate it.
CPU("holly", holly/holly_cpu, load, 100);
...
TEMP="System Temperature";
TEMP("amanda Fan0", amanda/amanda_fan0, "° C", 85);
Other than the keywords TEMPLATE and TEMPLATEDIR, the names of classes
can be any word. Normally, the TEMPLATE is set to the filename of
the template file to use. However, if the template is set to the reserved
value "CGI", the OCP will expect to get the name of the template file
in an HTTP GET or POST variable named template. This allows
multiple templates to be created that only show selected variables
or machines, by either including links terminated with "?template=anothertemplate",
or with an HTML form that sets the template variable. The TEMPLATE
may appear anywhere in the file but must not be repeated.
The TEMPLATEDIR reserved keyword specifies the fully qualified
path in which to look for OCP templates. This is required to prevent
users from specifying, for example, a template of /etc/passwd and
collecting information to which they should not have access. This
keyword may appear anywhere in the configuration file but must not
be repeated.
The rest of the global configuration file contains class definitions,
made up of both titles and contents. It may contain comments at
either the beginning of a line or after a semicolon. Comments start
with a pound sign and go to the end of the line. Data may be split
across multiple lines. All data lines must end with a semicolon.
The syntax for class titles is
CLASSNAME="class title";
These must be specified at any point in the configuration file before
the first data element is added. To actually add data to a class,
use:
CLASSNAME("name", logs, type[, limit]);
The name is the title of the link to the MRTG-generated Web page for
this variable. The second argument is the path to the series of files
MRTG creates when it monitors a variable. To get this name, use the
same path that MRTG uses, followed by the name of the variable MRTG
is monitoring. By following these rules, if ".log" is appended to
the given text, the string will be the MRTG logfile name; if ".html"
is appended, it will be the Web page MRTG generated. The type can
be either a keyword defining the type of data, or freeform text, in
double quotes, which will be displayed after the variable's most recent
value. Finally, if the limit is provided, this number will be the
"warning threshold," and any values above this number will cause the
timestamp that the OCP prints to appear in a bolded red font. Any
entries that appear in quotes in the configuration file may contain
HTML (except for the reserved keywords, which must match local filesystem
requirements).
Predefined data types include bits, kilobytes, percent, and load.
Each of them alters the data output somewhat. Supplying "bits" causes
data to be displayed as a number followed by b/s, Kb/s, or Mb/s.
Similarly, "kilobytes" will provide the units of KB or MB, depending
upon the size of the displayed value, but the second value is omitted
from display. This is because variables measured in kilobytes are
often used as a measurement of memory, and there is generally only
one useful data point for such measurements. Using "load" will cause
both collected data to be displayed as normal Unix load averages,
while "percent" displays only the first number, as a percentage
value. (Again, most data that are measured as a percentage do not
come only in pairs.) If a type is provided in quotes, that text
will appear after the first data point, and the second will be omitted
from display.
Once the configuration file has been filled in, it's time to edit
the other type of file that the OCP uses: a template. Templatizing
the output increases the flexibility of the OCP's output appearance.
These files are written in normal HTML, except that they may have
<ocp> tags in them. The syntax of this tag is <ocp
CLASSNAME=numcols>. That is, to insert a table with "NETLOAD"
data into an OCP Web page, one would use <ocp NETLOAD=2>
(for two-column output). Data is filled in with an alternating light
background in each cell, to increase visual differentiation without
being overly distracting. Extra cells (if the number of elements
in a class is not a multiple of the number of columns in the table)
are left blank.
Finishing Up
All that remains is to add the OCP PHP script or Perl script to
the Web server. This is as simple as adding links to it from an
existing page, or internally publishing the link or links, depending
upon how your site normally handles monitoring. If you need to protect
the data, normal access control mechanisms should be applied, which
might include multiple templates and CGI to ensure that various
users can only see the templates intended for them.
Useful Links
MRTG -- http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/
GD -- http://www.boutell.com/gd/
14all -- http://my14all.sourceforge.net/
MRTG Viewer -- http://noc.asti.dost.gov.ph/docus/tools/mrtg/mrtgviewer.php
OCP -- http://www.fort-tech.com/software/ocp/
Ben Stern has been a programmer and systems administrator for
the past seven years, with a focus on highly available DNS architecture.
He has deployed these tools on nationwide networks, and used them
operationally to provide early detection of unusual usage patterns
and network instability. For the past two years, he has been writing
commercial-grade free software for use in high-performance environments.
|