oct2004.tar

tSmoke: Automating Availability Measures with Smokeping

Dan McGinn-Combs

I manage a global network. Although I am on call all the time, it is not unusual for my colleagues in different time zones to refrain from calling me during my night. Even so, I like to know whether something is down on the network when I wake up in the morning. That way, I have some idea about how much of my day will be consumed with fighting fires and how much I can dedicate to my day job.

Recently, I was asked to provide a more formal network availability report -- something that could be used in my performance review. Because I already have Smokeping (see "Monitoring Latency with Smokeping": http://www.samag.com/documents/s=8284/sam0307d/) running and monitoring latency to critical network nodes and some additional servers, it seemed like a good idea to use this tool as a basis for automating both tasks. I just needed to figure out how to tie these items together.

Overview of Smokeping

Smokeping (http://www.smokeping.org) was developed by Tobias Oetiker to measure and graphically present network latency and packet loss. It is based on his round-robin database. A round-robin database stores numeric data as configurable averages in circular timeslots so the database file will not expand over time.

Smokeping manages these round-robin databases using the Perl bindings of RRDTool (http://www.rrdtool.org). Smokeping can monitor the network latency of a device using one of several included plug-ins. For example, the default "fping" plug-in sends out a rapid-fire group of 20 (by default) ICMP pings to a device. Fping returns the latency of the replies as well as the count of pings returned. This information is then tucked away in a round-robin database.

Tracking Availability

While this latency measure is extremely useful for providing information to users and management about network throughput, it provides no availability information. Because Smokeping adds the number of pings returned during testing of latency into the round-robin database, the database already contains a reasonably complete account of the success and failure of Smokeping to reach a given target.

Extracting Availability Data from the Round-Robin Database

The Perl binding RRDs::graph retrieves the information from the database. The PRINT directive of RRDTool provides an average of the number of pings returned across a given timeframe. All that is required is to convert this number into a percentage. To do this, I used the CDEF feature of RRDTool. CDEF makes data calculations using Reverse Polish Notation.

The script constructs a variable, "avail", based on the number of pings returned, which is stored in the variable loss. If the number of pings returned is greater than or equal to 10% of the pings (an arbitrary number), then the target device is available and loss is changed to 100. If the number of pings returned is less than 10%, then it is down and loss is changed to 0. By averaging these 100s and 0s together, we get the percent availability over the period defined by start and end:

$pings = 20;        # Default number of PINGs
RRDs::graph "fake.png",
         `--start','-86400',
         `-end','-300',
         "DEF:loss=${rrd}:loss:AVERAGE",
         "CDEF:avail=loss,$pings,GE,0,100,IF",
         "PRINT:avail:AVERAGE:%.2lf";

So, what happens if the target host is brand new? For the time before the device was being tracked, RRDTool will use "Not-a-Number", which will be converted by this formula into 100. This will be a false reading. Thus, it is important to convert anything that is not a number into a 0. For example, if you install a device on Wednesday, it will show downtime on Monday and Tuesday when you calculate its availability for the previous week. This is correct because the device was not physically there; it was also not available! The improved CDEF is:

"CDEF:avail=loss,UN,0,loss,IF,$pings,GE,0,100,IF"

Script Components and Reusing Smokeping Functions

With this calculation basis available, all that is left is to create a script that can extract the data and present it as required. This could be a daunting task. But, fortunately, Tobias Oetiker wrote Smokeping as a Perl module. Smokeping.pm is full of useful functions that make this task much easier. By calling and reusing these functions, the number of script functions is considerably reduced.

Building Block Approach to the Code

I decided to write tSmoke in a hierarchical method, so that it would be easier to complete. Starting with the lowest level of functions, I could complete and test each section before moving on to the next.

List Owner

Smokeping comes with an example configuration file called "config". Within this file is a section called "General". This section contains several key bits of information, such as the name and email address of the owner of the Smokeping installation. If I could read these bits of data, then I knew the parser was working! By calling the Smokeping.pm parser directly, I saved myself about 70 lines of complex code:

sub load_cfg ($) { 
    my $cfgfile = shift;
    my $parser = Smokeping::get_parser;
        $cfg = Smokeping::get_config $parser, $cfgfile;
}

The function load_cfg extracts information from the config file and places it in several memory hashes for later retrieval using "$cfg" to point to the hash. As tSmoke starts, it calls this function and prints the site information unless the quiet option "--quiet" is added to the command line:

print "tSmoke for network managed by $cfg->{General}{owner}\n
         at $cfg->{General}{contact}\n
         (c) 2003 Dan McGinn-Combs\n" unless $opt{quiet};

List the Round-Robin Databases

Once the script can parse the config file, the next step is to determine which round-robin databases contain the availability data. This information is stored in the Targets hash. The script recursively inspects each hash key. If the reference is "HASH", the script goes down another level:

        if ($rrds eq 'host') {
            $prline .= "$cfg->{General}{datadir}$path".".rrd\n";
            }

If the configuration contains a host directive, then the name is concatenated with the path from the "General" section to return a simple text list of the round-robin database names.

Morning Status

With the script returning the list of active round-robin databases, it is a simple matter to split them into an array variable:

my @rrds = split ( /\n/,list_rrds($cfg->{Targets},"","") );

And with the array, the script can iterate through the list using the Perl binding RRDs:fetch to check whether any of the databases have a zero in their current loss field. A zero would indicate that a host had returned no pings.

The script concatenates the down target names into a string. Then it sends the information to me using the sendmail function. I get this information every morning at 6 a.m. so as the day starts, I know which systems are down.

Weekly Availability Measurement

Obtaining the weekly availability measure is a little trickier. Remember, this is going to be part of my performance review. I was tempted to just list 100% in every category, but it seemed more challenging to get accurate data.

My first attempt was to use some Perl components to create an Excel spreadsheet, letting Excel make the appropriate data calculations. However, that was more complicated than necessary. Since the script is Perl, all the calculations could be done in Perl, which would simplify the weekly measures.

External HTML File to Construct a Mail Message

Again, following the lead of Tobias Oetiker in Smokeping.pm, I created an HTML template into which the availability data can be placed. Doing this means that a small change to the config file must be made.

I added a line to the General section defining the message template for the weekly availability HTML page:

*** General ***
tmail = /usr/local/Smokeping/etc/tmail

In order to read this variable, an addition must be made to the General section of get_parser in Smokeping.pm:

[ qw(owner imgcache imgurl datadir piddir sendmail smokemail cgiurl
mailhost contact syslogfacility syslogpriority tmail) ],

The weekly function uses the list_rrds function to gather the round-robin databases and the graph function discussed earlier to return an average availability of each target over four time periods: the previous day, the previous week, the previous month, and the previous quarter.

Too Much Detail

Then the script averages these averages together to provide a roll-up view of device types. For example, in my installation, I track the availability of routers in three geographic areas, North America, Europe, and Asia-Pacific. As a result, the router number reported is the sum of all the routers in all these areas.

The script can also report the average availability of each device individually. This report can get very complex, so I added another command-line option to tell the program how much detail to generate. If the command-line option "detail" is a zero or non-existent, the script will not generate any detail information. If "detail" is 1, the script provides the first level of detail, and so on.

The hierarchical name of each device is delimited by a dot, for example, "Router.NA.Atlanta." The script knows how much detail to include in the report by counting the number of dots included in the target description:

next if NumDots ($_) > $opt{detail};

Once the summary and detail information are gathered from the hashes, the script opens the tmail html file and replaces the components with the data:

open tSMOKE, $cfg->{General}{tmail} or die "ERROR: can't read \
  $cfg->{General}{tmail}\n";
         while (<tSMOKE>){
            my $Summary = Summary_Sheet();
            s/<##SUMMARY##>/$Summary/ig;
            my $Daily = DetailSheet(86400);        #sec per day
            s/<##DAYDETAIL##>/$Daily/ig;
            my $Weekly = DetailSheet(604800);      #sec per week
            s/<##WEEKDETAIL##>/$Weekly/ig;
            my $Monthly = DetailSheet(2592000);    #sec per month
            s/<##MONTHDETAIL##>/$Monthly/ig;
            my $Quarterly = DetailSheet(7776000);  #sec per quarter
            s/<##QUARTERDETAIL##>/$Quarterly/ig;
         $Body .= $_;
         }

Finally, the script sends the weekly availability report using the sendmail function (see Figure 1).

Availability Measurement Accuracy

The availability reports are only as accurate as the data used to create them. In the case of Smokeping, the data is always being summarized. Each data source is consolidated over time. This consolidation is the key to keeping the round-robin dataset from expanding over time. The trade-off is that the accuracy of the information becomes less and less sharp over the longer periods. Smokeping uses the following default sets of average consolidation factors:

1 -- Five-minute readings each placed into one of 1008 slots. In this consolidation, there is room for 60 hours worth of data with a downtime resolution of 5 minutes.

12 -- Five-minute readings averaged and placed into one of 4320 slots. In this consolidation, there is room for up to 180 days worth of data with a downtime resolution of 1 hour.

144 -- Five-minute readings averaged and placed into one of 720 slots. In this consolidation, there is room for 360 days worth of data with a downtime resolution of 12 hours.

RRDTool automatically picks the most appropriate consolidation factor for the time span it is asked to graph and print. For the previous day, it will choose the first consolidation factor. But for the previous week and up to 6 months, it will choose the second.

As can be seen, the longer the time span, the less accurate the numbers will be. In other words, the minimum time reported as down will be 5 minutes in the first consolidation, but it will be an hour in the second, and 12 hours in the third. At a quarterly and even semi-annually interval, however, the numbers will work reasonably well. In fact, over longer time spans, the averages will be more than accurate, because the numbers grow larger as the consolidation grows larger as well.

Installing tSmoke

tSmoke is easy to install. Place the Perl script into the Smokeping bin directory. Place the tmail file, which is used by the weekly availability summary, in the Smokeping etc directory. Mine are in the following locations:

/usr/local/smokeping/bin/tSmoke.pl
/usr/local/smokeping/etc/tmail

Edit the lines in tSmoke.pl indicating the location of the Smokeping and RRDtool library files:

# Point the lib variables to your implementation
use lib "/usr/local/smokeping/lib";
use lib "/usr/local/rrdtool-1.0.39/lib/perl";

Identify the location of your local configuration file:

# Point to your Smokeping config file
my $cfgfile = "/usr/local/smokeping/etc/config";

Finally, modify get_parser in Smokeping.pm so it will find the tmail file. Now, tSmoke is ready to use. To test it, enter the command:

/usr/local/Smokeping/bin/tSmoke.pl

Look for an introductory message like this:

tSmoke for network managed by Dan McGinn-Combs
at dan.mcginn-combs@geac.com
(c) 2003 Dan McGinn-Combs

At this point, you are ready to automate the daily and weekly reporting by adding a couple of entries into your crontab. My entries look like this. I get the daily report on my mobile phone at 6:15 a.m. I get the weekly report every Sunday at 8 a.m., so I can forward it to my manager:

# Smokeping crontabs
15 6 * * *      root    /usr/local/smokeping/bin/tSmoke.pl --q \
  --morning --to=mycellphone@mobile.net
0 8 * 0 *       root    /usr/local/smokeping/bin/tSmoke.pl --q \
  --to=myaddress@company.com --weekly

Summary

tSmoke is a handy tool for using the existing data in Smokeping to get more information about your network. tSmoke provides a point-in-time snapshot of anything that's down using the morning report. It also provides an objective measure based on the data gathered by Smokeping to show the overall status of system availability on the network. tSmoke shows that, with a little effort, you can do much more with Smokeping than just measure network latency.

Dan has been an IT manager with Geac Computer Corporation, Ltd based in Atlanta, Georgia for more than 20 years. His primary functions include management of the global network and enterprise-wide security operations. He enjoys applying real life problem-solving techniques with open source tools to create a solution. He has received a bachelor's degree from Southern Illinois University and holds GSEC security certification. He can be reached at dan.mcginn-combs@geac.com.