Cover V13, i07
jul2004.tar

Monitoring a SAN with MRTG

Mike Scott

Storage area networks (SANs) are relatively new to the sys admin's toolbox and they bring a plethora of benefits. Unfortunately, they also bring complexity. SAN technology can potentially connect a server to hundreds or even thousands of storage devices via a single fibre pair. Similarly, a single host with multiple host bus adaptors (HBAs) can generate a huge amount of cross-SAN traffic, potentially causing contention on shared devices.

The flow of traffic must be managed as the SAN grows, but before the traffic can be effectively managed, you must be able to monitor activity. Traffic is often balanced across multiple HBAs for performance and redundancy, and when considering the requirements for monitoring a SAN, it is important to consider that an "edge node" (a device on the outer periphery of the SAN) does not relate to a host or a storage array, but to an HBA. Thus, monitoring the SAN can be a headache -- with multiple HBAs and multiple paths through the SAN from host to storage array.

I recently saw a potential performance problem at a client site where multiple Solaris hosts accessed a single EMC Symmetrix (via a series of Brocade switches). It was suspected that the activities of one or more of the hosts were adversely affecting the others, including an important production system. A more permanent monitoring solution was planned, but unlikely to be implemented within a month. Thus, this project was started, and it provided visibility of the SAN within a few hours of its inception.

I chose Perl and PHP to implement this solution -- Perl because of its ability to handle complex data structures, and PHP for its easy integration into the Apache Web server. With hindsight, it might have been better to stick to one language, but this project was developed in a hurry.

Introducing MRTG

MRTG (Multi-Router Traffic Generator) is a data gathering and charting tool, written by Tobias Oetiker. As its name indicates, it was originally developed to monitor traditional LAN and WAN devices.

When I googled on the MRTG Web site, it became clear that MRTG could do almost everything that we required. Given the correct SNMP configuration for the switches, it can query the Fibre Channel statistics and graph each port on the switch. MRTG also allows mathematical operations on the SNMP-gathered data. This is a key enabler for this project, as it allows for multiple ports (potentially on different switches) to be aggregated in order to compute the overall throughput for any given host into the SAN.

The only problem was that it was clear that as the number of switches (and hence ports) monitored by the toolkit increased, the MRTG configuration file would rapidly become unmanageable. Furthermore, it was clear that although MRTG does a great job of generating all the necessary graphs and HTML to our specification, we needed several layers of abstraction to rapidly "drill down" into the data when troubleshooting.

This data should all be easily accessible by a browser, and so it became clear that the visualization layer would have to sit between the Apache server and the raw data to help the user interpret the results. To solve these challenges, two programs were written -- the configurator, and the visualizer.

The monitoring host was to be an existing Sun E450, running Solaris 9, with the Apache Web server and Perl (as bundled by Sun). The latest version of PHP and MRTG were then installed (along with the required libraries, most of which are usefully available pre-packaged at (http://www.sunfreeware.com).

Oetiker has included some excellent documentation on the MRTG Web site, so I'll not discuss the details of installing MRTG here. Once you've obtained the required libraries, it is an easy install.

To assist with the example provided in this article, Figure 1 shows a simplified diagram of a fictitious SAN setup and the port assignments are as follows:

switchA1 port assignments:
Port 7    symmA, FA#0
Port 1    hostA1, HBA#0
Port 4    hostA2, HBA#0
switchA2 port assignments:
Port 7    symmA, FA#1
Port 1    hostA1, HBA#1
Port 12    hostA2, HBA#1
Port 15    switchB1, port 15
switchB1 port assignments:
Port 3    symmB, FA#0
Port 1    hostB1, HBA#0
Port 9    hostB2, HBA#0
Port 15    switchA1, port 15
switchB1 port assignments:
Port 3    symmB, FA#1
Port 1    hostB1, HBA#1
Port 9    hostB2, HBA#1
Configuration

The first part of the exercise is to translate the actual switch configuration into the data structure at the head of the configurator program. This has already been done in Listing 1 for the example setup.

The raison d'être for this program is to allow a SAN configuration to be specified in a compact human-readable format. This ensures that it can be updated easily. The configurator will then take the SAN configuration and generate suitable MRTG configuration files.

The first part of the program is clearly the data structure that specifies the SAN. This data structure could be described as "switch-centric". It is very easy to accurately populate the initial values, as the structure can be compared directly with the output of the Brocade "switchShow" command, which will display which ports of the Fibre Channel switch are in use (of course, the Brocade command will show the World-Wide Numbers of the connected devices, which may require translation).

The data structure could easily be hived off to a separate file, or even integrated with a Web front-end to make updates even easier; however, for the sake of simplicity, it has been integrated with the Perl program. The remainder of the program consists of several "foreach" loops that will first invert the data structure so that that it can be later interpreted as "host-centric", and per-host aggregate formulae constructed.

When executed, the program processes the configuration data structure and correlates each HBA with its respective host. It then produces the MRTG configuration files (one per group). Listing 2 shows an example output from the configurator tool. For brevity, I have included only the first three targets of the "sitea.cfg" file.

Note that if a "global.inc" file exists in the configuration directory at the time that the configurator is executed, it will be included into the group configuration files via an MRTG "Include" directive. This allows us to have a file that can specify options common to all groups. An example of this is given in Listing 3. Also note the use of the AddHead directive in the "global.inc" file to include a CSS stylesheet to ensure that all pages have the same look as the rest of the Web site.

Once the configuration files have been generated, MRTG can be started according to the instructions on the MRTG Web site. For this, I chose to run it from cron every five minutes (preferably by a non-privileged user). Listing 4 shows an example crontab extract.

Visualization

By this point, you should have MRTG up and running. It should be generating HTML pages with PNG graphs, which is very useful but very difficult to navigate. The visualizer is a short piece of PHP code that will scan the group directories and present thumbnails in a logical and structured manner. Graphs are presented of the aggregate graphs only -- meaning that the clutter of the individual HBA graphs will be masked from the user.

When the visualizer presents a page of thumbnails, it links each thumbnail back to the MRTG-generated HTML page for the host's aggregate graphs. The user can then expand the selection simply by clicking on the thumbnail graph to obtain a better view of what is going on with that machine. Figure 2 shows an example of the group view, showing all hosts and storage devices. Note that switchb1 and switcha1 are listed -- these graphs represent the inter-switch links.

The visualizer code should be installed in your $HTMLDIR, as "index.php", and the Apache Web server configured with the directive "DirectoryIndex index.php" in order that it is picked up automatically when the browser requests a directory index.

When the user follows the graph thumbnail from the group page, the browser is redirected to the MRTG-produced HTML page for that host's aggregate statistics (see Figure 3). This shows detailed aggregate graphs, followed by thumbnails of each component HBA. Each HBA thumbnail can similarly be expanded by clicking on the graph thumbnail to give a detailed report of the activities of that individual adapter. See Figure 4 for an example.

Conclusion

Enterprise monitoring packages such as BMC Patrol and TeamQuest are very good at what they do, but occasionally we need to look at very specific sets of data for a short period of time while performance troubleshooting. Often, the easiest way to achieve this is to have a package such as MRTG available that can quickly be deployed for an ad hoc request.

In this example, MRTG allowed us to very quickly visualize the entire SAN. We identified key performance facts of which we were not previously aware:

  • Relative load on each individual Symmetrix from each server
  • Correlation of SAN activity against specific events (e.g., significant data loading operations that were causing concern for contention on the Symmetrix arrays)
  • Identification of times of significant activity, allowing us to feedback to the application owners with time-based data in order to reschedule I/O intensive applications to a quieter time

Often we want to monitor critical servers. In this case, it may be difficult to justify significant changes that could impact production. Monitoring the SAN from the switches via SNMP is perceived as a very non-intrusive method (and certainly easier than deploying software agents to all SAN-connected hosts). This solution would likely be approved by even the strictest change-management policies.

MRTG is an extremely flexible tool that enabled us to rapidly begin monitoring the SAN environment to analyze a specific problem. It is still being used today at the client site and is considered a valuable tool in performance monitoring. Tobias Oetiker and Dave Rand have done a tremendous job in developing this package, and it certainly deserves sys admins' consideration.

The code presented in this article is an example of a rapidly developed application and could certainly be improved upon. It is, however, a good example of what can be achieved by extending an existing generic application with tools such as PHP and Perl to suit a very specific set of requirements. An archive of these scripts, including more detail on the install instructions than was possible to include in this article, is available at: http://hindsight.it/san/.

References

Beauchamp, Chris and Josh Judd. Building SANs with Brocade, Syngress, 2001.

Clark, Tom. Designing Storage Area Networks, Addison-Wesley, 2000.

Oetiker, Tobias and Dave Rand, Multi Router Traffic Grapher, available at: http://people.ee.ethz.ch/~oetiker/webtools/mrtg/

Mike Scott is the director of Hindsight IT Ltd, a small Solaris consultancy based in Central Scotland. He has been working in the North East and the central belt for the past ten years, specializing in systems management with a keen interest in security and performance management. He can be contacted at: sysadmin@hindsight.it.