tSmoke:
Automating Availability Measures with Smokeping
Dan McGinn-Combs
I manage a global network. Although I am on call all the time,
it is not unusual for my colleagues in different time zones to refrain
from calling me during my night. Even so, I like to know whether
something is down on the network when I wake up in the morning.
That way, I have some idea about how much of my day will be consumed
with fighting fires and how much I can dedicate to my day job.
Recently, I was asked to provide a more formal network availability
report -- something that could be used in my performance review.
Because I already have Smokeping (see "Monitoring Latency with
Smokeping": http://www.samag.com/documents/s=8284/sam0307d/)
running and monitoring latency to critical network nodes and some
additional servers, it seemed like a good idea to use this tool
as a basis for automating both tasks. I just needed to figure out
how to tie these items together.
Overview of Smokeping
Smokeping (http://www.smokeping.org) was developed by Tobias
Oetiker to measure and graphically present network latency and packet
loss. It is based on his round-robin database. A round-robin database
stores numeric data as configurable averages in circular timeslots
so the database file will not expand over time.
Smokeping manages these round-robin databases using the Perl bindings
of RRDTool (http://www.rrdtool.org). Smokeping can monitor
the network latency of a device using one of several included plug-ins.
For example, the default "fping" plug-in sends out a rapid-fire
group of 20 (by default) ICMP pings to a device. Fping returns the
latency of the replies as well as the count of pings returned. This
information is then tucked away in a round-robin database.
Tracking Availability
While this latency measure is extremely useful for providing information
to users and management about network throughput, it provides no
availability information. Because Smokeping adds the number of pings
returned during testing of latency into the round-robin database,
the database already contains a reasonably complete account of the
success and failure of Smokeping to reach a given target.
Extracting Availability Data from the Round-Robin Database
The Perl binding RRDs::graph retrieves the information from the
database. The PRINT directive of RRDTool provides an average of
the number of pings returned across a given timeframe. All that
is required is to convert this number into a percentage. To do this,
I used the CDEF feature of RRDTool. CDEF makes data calculations
using Reverse Polish Notation.
The script constructs a variable, "avail", based on
the number of pings returned, which is stored in the variable loss.
If the number of pings returned is greater than or equal to 10%
of the pings (an arbitrary number), then the target device is available
and loss is changed to 100. If the number of pings returned is less
than 10%, then it is down and loss is changed to 0. By averaging
these 100s and 0s together, we get the percent availability over
the period defined by start and end:
$pings = 20; # Default number of PINGs
RRDs::graph "fake.png",
`--start','-86400',
`-end','-300',
"DEF:loss=${rrd}:loss:AVERAGE",
"CDEF:avail=loss,$pings,GE,0,100,IF",
"PRINT:avail:AVERAGE:%.2lf";
So, what happens if the target host is brand new? For the time before
the device was being tracked, RRDTool will use "Not-a-Number",
which will be converted by this formula into 100. This will be a false
reading. Thus, it is important to convert anything that is not a number
into a 0. For example, if you install a device on Wednesday, it will
show downtime on Monday and Tuesday when you calculate its availability
for the previous week. This is correct because the device was not
physically there; it was also not available! The improved CDEF is:
"CDEF:avail=loss,UN,0,loss,IF,$pings,GE,0,100,IF"
Script Components and Reusing Smokeping Functions
With this calculation basis available, all that is left is to
create a script that can extract the data and present it as required.
This could be a daunting task. But, fortunately, Tobias Oetiker
wrote Smokeping as a Perl module. Smokeping.pm is full of useful
functions that make this task much easier. By calling and reusing
these functions, the number of script functions is considerably
reduced.
Building Block Approach to the Code
I decided to write tSmoke in a hierarchical method, so that it
would be easier to complete. Starting with the lowest level of functions,
I could complete and test each section before moving on to the next.
List Owner
Smokeping comes with an example configuration file called "config".
Within this file is a section called "General". This section
contains several key bits of information, such as the name and email
address of the owner of the Smokeping installation. If I could read
these bits of data, then I knew the parser was working! By calling
the Smokeping.pm parser directly, I saved myself about 70 lines
of complex code:
sub load_cfg ($) {
my $cfgfile = shift;
my $parser = Smokeping::get_parser;
$cfg = Smokeping::get_config $parser, $cfgfile;
}
The function load_cfg extracts information from the config file and
places it in several memory hashes for later retrieval using "$cfg"
to point to the hash. As tSmoke starts, it calls this function and
prints the site information unless the quiet option "--quiet"
is added to the command line:
print "tSmoke for network managed by $cfg->{General}{owner}\n
at $cfg->{General}{contact}\n
(c) 2003 Dan McGinn-Combs\n" unless $opt{quiet};
List the Round-Robin Databases
Once the script can parse the config file, the next step is to determine
which round-robin databases contain the availability data. This information
is stored in the Targets hash. The script recursively inspects each
hash key. If the reference is "HASH", the script goes down
another level:
if ($rrds eq 'host') {
$prline .= "$cfg->{General}{datadir}$path".".rrd\n";
}
If the configuration contains a host directive, then the name is concatenated
with the path from the "General" section to return a simple
text list of the round-robin database names.
Morning Status
With the script returning the list of active round-robin databases,
it is a simple matter to split them into an array variable:
my @rrds = split ( /\n/,list_rrds($cfg->{Targets},"","") );
And with the array, the script can iterate through the list using
the Perl binding RRDs:fetch to check whether any of the databases
have a zero in their current loss field. A zero would indicate that
a host had returned no pings.
The script concatenates the down target names into a string.
Then it sends the information to me using the sendmail function.
I get this information every morning at 6 a.m. so as the day starts,
I know which systems are down.
Weekly Availability Measurement
Obtaining the weekly availability measure is a little trickier.
Remember, this is going to be part of my performance review. I was
tempted to just list 100% in every category, but it seemed more
challenging to get accurate data.
My first attempt was to use some Perl components to create
an Excel spreadsheet, letting Excel make the appropriate data calculations.
However, that was more complicated than necessary. Since the script
is Perl, all the calculations could be done in Perl, which would
simplify the weekly measures.
External HTML File to Construct a Mail Message
Again, following the lead of Tobias Oetiker in Smokeping.pm,
I created an HTML template into which the availability data can
be placed. Doing this means that a small change to the config file
must be made.
I added a line to the General section defining the message
template for the weekly availability HTML page:
*** General ***
tmail = /usr/local/Smokeping/etc/tmail
In order to read this variable, an addition must be made to the General
section of get_parser in Smokeping.pm:
[ qw(owner imgcache imgurl datadir piddir sendmail smokemail cgiurl
mailhost contact syslogfacility syslogpriority tmail) ],
The weekly function uses the list_rrds function to gather the round-robin
databases and the graph function discussed earlier to return an average
availability of each target over four time periods: the previous day,
the previous week, the previous month, and the previous quarter.
Too Much Detail
Then the script averages these averages together to provide
a roll-up view of device types. For example, in my installation,
I track the availability of routers in three geographic areas, North
America, Europe, and Asia-Pacific. As a result, the router number
reported is the sum of all the routers in all these areas.
The script can also report the average availability of each
device individually. This report can get very complex, so I added
another command-line option to tell the program how much detail
to generate. If the command-line option "detail" is a
zero or non-existent, the script will not generate any detail information.
If "detail" is 1, the script provides the first level
of detail, and so on.
The hierarchical name of each device is delimited by a dot,
for example, "Router.NA.Atlanta." The script knows how
much detail to include in the report by counting the number of dots
included in the target description:
next if NumDots ($_) > $opt{detail};
Once the summary and detail information are gathered from the hashes,
the script opens the tmail html file and replaces the components with
the data:
open tSMOKE, $cfg->{General}{tmail} or die "ERROR: can't read \
$cfg->{General}{tmail}\n";
while (<tSMOKE>){
my $Summary = Summary_Sheet();
s/<##SUMMARY##>/$Summary/ig;
my $Daily = DetailSheet(86400); #sec per day
s/<##DAYDETAIL##>/$Daily/ig;
my $Weekly = DetailSheet(604800); #sec per week
s/<##WEEKDETAIL##>/$Weekly/ig;
my $Monthly = DetailSheet(2592000); #sec per month
s/<##MONTHDETAIL##>/$Monthly/ig;
my $Quarterly = DetailSheet(7776000); #sec per quarter
s/<##QUARTERDETAIL##>/$Quarterly/ig;
$Body .= $_;
}
Finally, the script sends the weekly availability report using the
sendmail function (see Figure 1).
Availability Measurement Accuracy
The availability reports are only as accurate as the data used
to create them. In the case of Smokeping, the data is always being
summarized. Each data source is consolidated over time. This consolidation
is the key to keeping the round-robin dataset from expanding over
time. The trade-off is that the accuracy of the information becomes
less and less sharp over the longer periods. Smokeping uses the
following default sets of average consolidation factors:
1 -- Five-minute readings each placed into one of 1008
slots. In this consolidation, there is room for 60 hours worth of
data with a downtime resolution of 5 minutes.
12 -- Five-minute readings averaged and placed into one
of 4320 slots. In this consolidation, there is room for up to 180
days worth of data with a downtime resolution of 1 hour.
144 -- Five-minute readings averaged and placed into one
of 720 slots. In this consolidation, there is room for 360 days
worth of data with a downtime resolution of 12 hours.
RRDTool automatically picks the most appropriate consolidation
factor for the time span it is asked to graph and print. For the
previous day, it will choose the first consolidation factor. But
for the previous week and up to 6 months, it will choose the second.
As can be seen, the longer the time span, the less accurate
the numbers will be. In other words, the minimum time reported as
down will be 5 minutes in the first consolidation, but it will be
an hour in the second, and 12 hours in the third. At a quarterly
and even semi-annually interval, however, the numbers will work
reasonably well. In fact, over longer time spans, the averages will
be more than accurate, because the numbers grow larger as the consolidation
grows larger as well.
Installing tSmoke
tSmoke is easy to install. Place the Perl script into the Smokeping
bin directory. Place the tmail file, which is used by the weekly
availability summary, in the Smokeping etc directory. Mine are in
the following locations:
/usr/local/smokeping/bin/tSmoke.pl
/usr/local/smokeping/etc/tmail
Edit the lines in tSmoke.pl indicating the location of the Smokeping
and RRDtool library files:
# Point the lib variables to your implementation
use lib "/usr/local/smokeping/lib";
use lib "/usr/local/rrdtool-1.0.39/lib/perl";
Identify the location of your local configuration file:
# Point to your Smokeping config file
my $cfgfile = "/usr/local/smokeping/etc/config";
Finally, modify get_parser in Smokeping.pm so it will find the tmail
file. Now, tSmoke is ready to use. To test it, enter the command:
/usr/local/Smokeping/bin/tSmoke.pl
Look for an introductory message like this:
tSmoke for network managed by Dan McGinn-Combs
at dan.mcginn-combs@geac.com
(c) 2003 Dan McGinn-Combs
At this point, you are ready to automate the daily and weekly reporting
by adding a couple of entries into your crontab. My entries look like
this. I get the daily report on my mobile phone at 6:15 a.m. I get
the weekly report every Sunday at 8 a.m., so I can forward it to my
manager:
# Smokeping crontabs
15 6 * * * root /usr/local/smokeping/bin/tSmoke.pl --q \
--morning --to=mycellphone@mobile.net
0 8 * 0 * root /usr/local/smokeping/bin/tSmoke.pl --q \
--to=myaddress@company.com --weekly
Summary
tSmoke is a handy tool for using the existing data in Smokeping
to get more information about your network. tSmoke provides a point-in-time
snapshot of anything that's down using the morning report.
It also provides an objective measure based on the data gathered
by Smokeping to show the overall status of system availability on
the network. tSmoke shows that, with a little effort, you can do
much more with Smokeping than just measure network latency.
Dan has been an IT manager with Geac Computer Corporation,
Ltd based in Atlanta, Georgia for more than 20 years. His primary
functions include management of the global network and enterprise-wide
security operations. He enjoys applying real life problem-solving
techniques with open source tools to create a solution. He has received
a bachelor's degree from Southern Illinois University and holds
GSEC security certification. He can be reached at dan.mcginn-combs@geac.com. |