Monitoring
a SAN with MRTG
Mike Scott
Storage area networks (SANs) are relatively new to the sys admin's
toolbox and they bring a plethora of benefits. Unfortunately, they
also bring complexity. SAN technology can potentially connect a
server to hundreds or even thousands of storage devices via a single
fibre pair. Similarly, a single host with multiple host bus adaptors
(HBAs) can generate a huge amount of cross-SAN traffic, potentially
causing contention on shared devices.
The flow of traffic must be managed as the SAN grows, but before
the traffic can be effectively managed, you must be able to monitor
activity. Traffic is often balanced across multiple HBAs for performance
and redundancy, and when considering the requirements for monitoring
a SAN, it is important to consider that an "edge node" (a device
on the outer periphery of the SAN) does not relate to a host or
a storage array, but to an HBA. Thus, monitoring the SAN can be
a headache -- with multiple HBAs and multiple paths through the
SAN from host to storage array.
I recently saw a potential performance problem at a client site
where multiple Solaris hosts accessed a single EMC Symmetrix (via
a series of Brocade switches). It was suspected that the activities
of one or more of the hosts were adversely affecting the others,
including an important production system. A more permanent monitoring
solution was planned, but unlikely to be implemented within a month.
Thus, this project was started, and it provided visibility of the
SAN within a few hours of its inception.
I chose Perl and PHP to implement this solution -- Perl because
of its ability to handle complex data structures, and PHP for its
easy integration into the Apache Web server. With hindsight, it
might have been better to stick to one language, but this project
was developed in a hurry.
Introducing MRTG
MRTG (Multi-Router Traffic Generator) is a data gathering and
charting tool, written by Tobias Oetiker. As its name indicates,
it was originally developed to monitor traditional LAN and WAN devices.
When I googled on the MRTG Web site, it became clear that MRTG
could do almost everything that we required. Given the correct SNMP
configuration for the switches, it can query the Fibre Channel statistics
and graph each port on the switch. MRTG also allows mathematical
operations on the SNMP-gathered data. This is a key enabler for
this project, as it allows for multiple ports (potentially on different
switches) to be aggregated in order to compute the overall throughput
for any given host into the SAN.
The only problem was that it was clear that as the number of switches
(and hence ports) monitored by the toolkit increased, the MRTG configuration
file would rapidly become unmanageable. Furthermore, it was clear
that although MRTG does a great job of generating all the necessary
graphs and HTML to our specification, we needed several layers of
abstraction to rapidly "drill down" into the data when troubleshooting.
This data should all be easily accessible by a browser, and so
it became clear that the visualization layer would have to sit between
the Apache server and the raw data to help the user interpret the
results. To solve these challenges, two programs were written --
the configurator, and the visualizer.
The monitoring host was to be an existing Sun E450, running Solaris
9, with the Apache Web server and Perl (as bundled by Sun). The
latest version of PHP and MRTG were then installed (along with the
required libraries, most of which are usefully available pre-packaged
at (http://www.sunfreeware.com).
Oetiker has included some excellent documentation on the MRTG
Web site, so I'll not discuss the details of installing MRTG here.
Once you've obtained the required libraries, it is an easy install.
To assist with the example provided in this article, Figure 1
shows a simplified diagram of a fictitious SAN setup and the port
assignments are as follows:
switchA1 port assignments:
Port 7 symmA, FA#0
Port 1 hostA1, HBA#0
Port 4 hostA2, HBA#0
switchA2 port assignments:
Port 7 symmA, FA#1
Port 1 hostA1, HBA#1
Port 12 hostA2, HBA#1
Port 15 switchB1, port 15
switchB1 port assignments:
Port 3 symmB, FA#0
Port 1 hostB1, HBA#0
Port 9 hostB2, HBA#0
Port 15 switchA1, port 15
switchB1 port assignments:
Port 3 symmB, FA#1
Port 1 hostB1, HBA#1
Port 9 hostB2, HBA#1
Configuration
The first part of the exercise is to translate the actual switch
configuration into the data structure at the head of the configurator
program. This has already been done in Listing 1 for the example
setup.
The raison d'être for this program is to allow a SAN configuration
to be specified in a compact human-readable format. This ensures
that it can be updated easily. The configurator will then take the
SAN configuration and generate suitable MRTG configuration files.
The first part of the program is clearly the data structure that
specifies the SAN. This data structure could be described as "switch-centric".
It is very easy to accurately populate the initial values, as the
structure can be compared directly with the output of the Brocade
"switchShow" command, which will display which ports of the Fibre
Channel switch are in use (of course, the Brocade command will show
the World-Wide Numbers of the connected devices, which may require
translation).
The data structure could easily be hived off to a separate file,
or even integrated with a Web front-end to make updates even easier;
however, for the sake of simplicity, it has been integrated with
the Perl program. The remainder of the program consists of several
"foreach" loops that will first invert the data structure so that
that it can be later interpreted as "host-centric", and per-host
aggregate formulae constructed.
When executed, the program processes the configuration data structure
and correlates each HBA with its respective host. It then produces
the MRTG configuration files (one per group). Listing 2 shows an
example output from the configurator tool. For brevity, I have included
only the first three targets of the "sitea.cfg" file.
Note that if a "global.inc" file exists in the configuration directory
at the time that the configurator is executed, it will be included
into the group configuration files via an MRTG "Include" directive.
This allows us to have a file that can specify options common to
all groups. An example of this is given in Listing 3. Also note
the use of the AddHead directive in the "global.inc" file to include
a CSS stylesheet to ensure that all pages have the same look as
the rest of the Web site.
Once the configuration files have been generated, MRTG can be
started according to the instructions on the MRTG Web site. For
this, I chose to run it from cron every five minutes (preferably
by a non-privileged user). Listing 4 shows an example crontab extract.
Visualization
By this point, you should have MRTG up and running. It should
be generating HTML pages with PNG graphs, which is very useful but
very difficult to navigate. The visualizer is a short piece of PHP
code that will scan the group directories and present thumbnails
in a logical and structured manner. Graphs are presented of the
aggregate graphs only -- meaning that the clutter of the individual
HBA graphs will be masked from the user.
When the visualizer presents a page of thumbnails, it links each
thumbnail back to the MRTG-generated HTML page for the host's aggregate
graphs. The user can then expand the selection simply by clicking
on the thumbnail graph to obtain a better view of what is going
on with that machine. Figure 2 shows an example of the group view,
showing all hosts and storage devices. Note that switchb1 and switcha1
are listed -- these graphs represent the inter-switch links.
The visualizer code should be installed in your $HTMLDIR, as "index.php",
and the Apache Web server configured with the directive "DirectoryIndex
index.php" in order that it is picked up automatically when the
browser requests a directory index.
When the user follows the graph thumbnail from the group page,
the browser is redirected to the MRTG-produced HTML page for that
host's aggregate statistics (see Figure 3). This shows detailed
aggregate graphs, followed by thumbnails of each component HBA.
Each HBA thumbnail can similarly be expanded by clicking on the
graph thumbnail to give a detailed report of the activities of that
individual adapter. See Figure 4 for an example.
Conclusion
Enterprise monitoring packages such as BMC Patrol and TeamQuest
are very good at what they do, but occasionally we need to look
at very specific sets of data for a short period of time while performance
troubleshooting. Often, the easiest way to achieve this is to have
a package such as MRTG available that can quickly be deployed for
an ad hoc request.
In this example, MRTG allowed us to very quickly visualize the
entire SAN. We identified key performance facts of which we were
not previously aware:
- Relative load on each individual Symmetrix from each server
- Correlation of SAN activity against specific events (e.g.,
significant data loading operations that were causing concern
for contention on the Symmetrix arrays)
- Identification of times of significant activity, allowing us
to feedback to the application owners with time-based data in
order to reschedule I/O intensive applications to a quieter time
Often we want to monitor critical servers. In this case, it may
be difficult to justify significant changes that could impact production.
Monitoring the SAN from the switches via SNMP is perceived as a
very non-intrusive method (and certainly easier than deploying software
agents to all SAN-connected hosts). This solution would likely be
approved by even the strictest change-management policies.
MRTG is an extremely flexible tool that enabled us to rapidly
begin monitoring the SAN environment to analyze a specific problem.
It is still being used today at the client site and is considered
a valuable tool in performance monitoring. Tobias Oetiker and Dave
Rand have done a tremendous job in developing this package, and
it certainly deserves sys admins' consideration.
The code presented in this article is an example of a rapidly
developed application and could certainly be improved upon. It is,
however, a good example of what can be achieved by extending an
existing generic application with tools such as PHP and Perl to
suit a very specific set of requirements. An archive of these scripts,
including more detail on the install instructions than was possible
to include in this article, is available at: http://hindsight.it/san/.
References
Beauchamp, Chris and Josh Judd. Building SANs with Brocade,
Syngress, 2001.
Clark, Tom. Designing Storage Area Networks, Addison-Wesley,
2000.
Oetiker, Tobias and Dave Rand, Multi Router Traffic Grapher, available
at: http://people.ee.ethz.ch/~oetiker/webtools/mrtg/
Mike Scott is the director of Hindsight IT Ltd, a small Solaris
consultancy based in Central Scotland. He has been working in the
North East and the central belt for the past ten years, specializing
in systems management with a keen interest in security and performance
management. He can be contacted at: sysadmin@hindsight.it. |