Cover V12, I04

Article

apr2003.tar

NetWorker Savegroup Summarizer -- A Legato NetWorker Reporting Tool

John Stoffel

NetWorker Savegroup Summarizer (NSS) came about when we replaced our old four-drive, 8mm (5-GB capacity!) tape jukebox with a new tape jukebox holding two DLT7000 tape drives, driven by a dual-CPU Sun Ultra 2 server. At the time, we were running Legato's NetWorker Backup software version 4.2.x to back up about 100 clients of mixed types, mostly UNIX with some PC and NetWare clients. I wanted to get a better idea of exactly how much data we were backing up, how long backups took to complete, and where the performance bottlenecks were. I also wanted to see how much real improvement we were getting in our backup performance after the upgrade to the new hardware.

Legato NetWorker is a powerful tool to manage backup needs, but it has some frustrating limitations when it comes to reporting useful statistics from a completed backup. The 5.x version was especially limited in regard to reports you could get from the base product. See Example 1 for a sample savegroup notification report. At the time this program was first started, the NetWorker Management Console product was not available, and the GEMs reporting tool was an expensive add-on. (See the sidebar for an overview of Legato.)

After looking around and asking on the NetWorker mailing list (see Resources), I ended up writing my own code as presented here.

The goals of the code were as follows:

1. Email the administrator a concise daily report detailing overall status of the backups completed each night.

2. Report statistics on how much data was backed up and how long it took.

3. Report on the largest saveset backup(s) in terms of the amount of data, as well as the saveset backup(s) that took the most time, since these are not always the same.

4. Keep the report in plain ASCII format, 80 columns maximum width, so that reports are useful even on dumb terminal displays.

See Example 2 for a sample report generated by NSS.

Problem Description

The initial implementation of NSS just took the raw information stored in the /nsr/logs/messages log file and parsed out the needed information. This worked at first, because I could first look for the matching lines that signified the start of a savegroup report, and then look for the matching end of the report. Then some post-processing would be done on the data to make it more presentable.

This process broke, however, when we had multiple savegroups running at different times and with different clients that overlapped in their start/end times. Since the messages file was just an ever-growing file, it was impossible to tell which savegroup a line of output belonged to without knowing lots of internal details of each savegroup. Because I didn't want to do inquiries to NetWorker's internal database, and because all the information I need was in the standard savegroup notification, I wrote the savegroup notifications to a file and ran them through NSS by hand. This was automated fairly quickly with a procmail recipe.

Over time, NSS has been expanded with extra features, such as a tape report to show how many tapes in each tape pool are full, partially full, or empty. This is a useful check to ensure that you have enough tapes for the next night's run. The limitation here is that you must make a call to the mminfo command, which may or may not be available on the system running the report. Another useful feature is the ability to save easily parsed summaries of the backups for each savegroup recording such details as:

  • Start and end time
  • Time taken to save
  • Number of clients
  • Number of failed clients
  • Size of the backup in kilobytes
  • Backup level

The server and savegroup names are implicit in the default directory structure and filenames used. See the online usage of NSS for details, and Example 3 for a sample log file.

The eventual goal is to write a plotting (nss-plot, see the resources) program to graph the output in a nice overview format. This project, however, has been going very slowly due to time constraints and limitations of the various plotting tools.

Breakdown of NSS and How It Works

The first step of the script is the parsing of the various command-line options. You can change all settings and actions from the command-line switches, so you can embed the script inside another script if need be.

State Machines

The core stage is to process STDIN and to scan the input for the start and end markers that determine the savegroup report. This is really the heart of the code, and can be thought of as a simple finite state machine. Any good book on computer theory and programming will describe state machines in more detail, but one definition is:

  • An initial state
  • A set of possible input
  • A set of new states that may result from the input
  • A set of possible actions or output events that result from a new state

This is a very useful way of thinking when you are trying to write a program (which is just a fancy state machine) to parse a set of data to turn it into a more useful representation. Humans are very good at pattern matching, but as seen when trying to write a regular expression to handle all the myriad possibilities to pull out that one piece of data, computers are not as good at pattern recognition, or more accurately, exception handling in pattern matching.

This is where the power of Perl comes in, since we have both a need for regular expressions to match and find the markers that determine which state we are in, as well as the regular expressions that are used to pull out the needed data inside each section. Luckily, Legato has been quite static with the format of the savegroup reports over three major versions of the software.

Another useful feature of Perl 5.x is the addition of complex data structures, so we can do hashes of hashes to store the various bits of information that are pulled from the savegroup report. This document cannot really go into this in any great depth, so I would recommend that you see the Perl documentation, especially the perldsc and perlreftut man pages, or the O'Reilly & Associate books Programming Perl (3rd Edition) by Larry Wall, Tom Christiansen, and Jon Orwant and Advanced Perl Programming by Sriram Srinivasan.

Breaking down the problem into smaller steps is just one technique that can be used to make the problem more manageable. In general, when processing a random stream of input, looking for data, you will have the following states and transitions that are possible as you scan through the input stream.

  • I have not found the data I need.
  • I have found the data I need.
  • I have found a different set of data I need.
  • I am at the end of the data I need.
  • No more data to read.

Transitions are how you move from one state to another, and are really just how you indicate to the computer what to do on the next step of input.

When parsing input, you can choose the chunk size, such as characters, words, lines, paragraphs, etc. It can be tricky to determine which state to be in (i.e., what to do with the input) when states span multiple chunks. In such situations, it makes sense to break those spans into multiple states. So, when you find Chunk A, you know to look for Chunk B, or you fail and reset to the state before you found Chunk A, or just go back to a default state.

For example, say you are reading input one line at a time, but the markers you need to match against are split across two lines, with the markers being "foo" and "bar", and the end marker being "END", or the "EOF". Some basic Perl code to handle this can be seen in Figure 3. Note that while we are reading our input, the variable $state keeps track of the state we are in, which tells us what we are expecting for input. This lets us skip the processing of other states that don't apply or that might cause problems if we can't tell what to do with the input without knowing where we are in the processing. Also notice how we jump out of the various "if-then" constructs and back up to the top of the loop to read another line of input.

Parsing Input in Perl

Using this type of programming, we can process almost any type of input. When parsing a NetWorker savegroup report, there are four possible states the input can be in:

1. We haven't found the start of the savegroup report.

2. We've found the header and parsed the info from it.

3. We're done with the header but now we have "Unsuccessful Save Sets" to read in and process.

4. We've found the "Successful Save Sets" marker and we're processing them.

In state one, we are looking for the marker that will drop us into state two, where we will process the actual savegroup. This lets us skip mail headers or other extraneous information.

In state two, we have matched the regular expression that tells us we have found the start of the savegroup report. The code currently supports both Legato NetWorker and "Solstice Backup", which is Sun's name for their OEM'd version from Legato.

In this state, we pull out information on the number of clients, the savegroup name, the start (or restart) time, and the end time. We might also find out the name of the backup server if we're lucky. This is one of the more frustrating limitations of the savegroup report format, because it does not explicitly state the name of the backup server.

So, we need state three to help figure out the name of the backup server, because it's not explicitly stated in the savegroup report anywhere. But, because we know that all indexes are saved on the server, we can look for lines that mention "index save" and pull the server name information from there. State three then may or may not be used, depending on the savegroup report and whether there were any failures.

The main work is handled in the fourth state, which is where we process the individual client saveset (think filesystems) reports. Again, we try to determine which host is really the server. Note that when the server is finished writing all of a client's savesets to tape, it will write the client index on the server to tape as well, so we also look for that information. As each client's information is read in, we find: client name, saveset (directory), level, total amount of data written, the scale used for this measurement (e.g., kilo-, mega-, giga- or terabytes), the time it took to write the data, and the number of files saved.

There is no finish state. I assume that the input will continue to match in the fourth state, but this is handled by skipping any input that doesn't match the NetWorker report format, so processing continues until the EOF.

One key step during the processing of client saveset reports in state four is to take the amount of data saved and scale it into kilobytes, so that it's all consistent. This simplifies other steps further in the processing and printing of reports. (Note that I use the standard 1024 bytes in a kilobyte, not the marketing driven version that uses 1000 bytes in a kilobyte.)

Perl's ability to handle complex data structures is also a key element here, because it lets us store the data we've parsed in a hash of hashes of hashes. This could have been done using split() and join() to build long keys to hold the information, or using multiple parallel hashes, each for just one piece of info. Using a complex data structure contains all the data in one structure for each use. The basic data structure is as follows:

Client --> Saveset Name --> Level
                                   --> Total Data Written
                                   --> Time to Write Data
                                   --> Number of Files Written

Each level of the above structure is a hash, pointing to one or more sub-hashes as needed. The first level is the name of the client, and since each client can have multiple Savesets, that leads to the second level of the hash. At the third level, we could have gone to a fixed array to hold the information, but continuing the use of hashes serves two purposes. One, it's self documenting -- no need to remember that index 2 is the number of files written by the Client in that particular saveset. Two, the sorting and report generation functions are simpler and most consistent since the entire data structure is just hashes.

Post-Processing and Reporting

After all the data has been parsed, we process the data to pull out the start and end times, as well as sum the total amount of data written to tape across all savesets from all clients. This is a simple step, since we already have all the sizes in kilobytes, so we simply sum them up by both level and the overall total.

Once this is done, we must determine which scale to use when showing the reports. Generally, I like to use the biggest scale possible, since if we have written gigabytes of data to tape, it's not too informative to know how many megabytes it is.

After that, we can pick and choose among the various reports and output them. When printing reports in Perl, most people think of using formats as a quick and dirty option, but there are some limitations to this method that I found frustrating -- mostly dealing with empty line compaction. So, I only use formats for the main summary section, and instead use printf() for the various extra reports.

These other reports are fairly self-explanatory, but I will look at how a couple of them work. The print_top_n_size() function will sort and display the clients that wrote the most data. This is broken into two steps -- the first of which goes through all the clients and totals up the size of all savesets written by that client. The data is then put into a temporary hash. The second step is to print out the header of the report and loop through the temporary hash holding the client totals, printing until we either run out of clients or reach the maximum number of clients to show. Generally, only the top five or ten clients are interesting in terms of the amount of data written.

In contrast, the report for the top N hosts by time is broken down by saveset, since a client with a very small amount of data could have a problem and take a large amount of time to write that data. This is interesting data and should be shown for troubleshooting purposes. The general structure of the report is the same though, where a first loop through the data builds a temporary hash of the needed info, then the header is printed and the report is written.

One suggestion would be to include the size of the data written, as well as the time, but I haven't felt a need for this, and it has not been requested. There is also an issue with the report width, since I am trying to keep the entire report less than 80 columns in width if at all possible.

Setting Up and Using NSS

To run these scripts, you need a reasonably up-to-date version of Perl (see resources) and the Time::Local and Getopt::Long libraries, both of which come standard with Perl 5.000 and newer.

Edit the file to make sure that you have the correct path to your locally installed version of Perl. You can also edit the first few lines that specify the default directories for where the logs and the savegroup input should be saved. These can also be specified on the command line with the -L and -O options, respectively.

You can feed the raw Savegroup summaries directly to NSS via STDIN to get a nice report, as shown in Example 2. This is very useful for testing or just running off a quick report to make sure things are working correctly for your site.

Another technique would be to use procmail to filter incoming savegroup notification emails and forward the summaries, while saving the raw savegroup notification to a mail folder. This is slightly trickier, but eliminates the worry that you'll lose the raw notifications sent to your sys admins. See Figure 2 for an example procmail recipe; you will probably have to tweak it to recognize the format of email sent to you. In this example, the email is sent from SERVER, and it is saved in a mail folder in the user's Mail/ directory. For a more complete discussion of procmail and how to write recipes, see the resources section.

You can call NSS directly from NetWorker so that only summaries get emailed to the sys admins. Here are instructions for Legato NetWorker 5.1.1 on Solaris. They might be slightly different depending on which version of NetWorker and which OS you are running.

1. Start up the /usr/bin/nsr/nwadmin GUI and connect to your NetWorker server.

2. Click on Customize -> Notifications.

3. A new window will pop up with a list of notifications. Scroll down and highlight "Savegroup completion".

4. Edit the action to be as follows:

   /path/to/nss -o -l -s 10 -t 10 -T -m "admins@foo.com"
5. Click on the "Apply" button.

The above options deserve some explanation, as shown in Figure 1. Note, however, that you can see the online help with an explanation of all the arguments by passing the -h flag to nss when you run it from the command line.

The -o option tells NSS to write the savegroup notification to the default saveinput directory, as specified in the source code, or by the -O option. The filename format option, -F, defaults to "%S/%G-%D" where %S is the backup server name, %G is the savegroup name, and %D is the date of the savegroup notification. This lets you log the data from multiple backup servers, each with multiple savegroups into a central and consistent directory structure. If necessary, you can recreate reports by running NSS on the file(s) as needed.

The -l option is similar, but it turns on the logging of the summary data to be plotted later in some sort of data analysis tool, such as plot-nss. It also has a companion option, -L for which directory to log the data to.

The -s and -t options can be used with or without optional numbers. They turn on the reporting of the top N (default 5) savesets in terms of size and time, respectively.

The -T option turns on the tape report. Please note that this option can only show you the current status of tape use at the time the report is run. So, if you run the report immediately after the savegroup notification is sent by NetWorker, you will get a reasonably accurate status of the tapes remaining at that time. If you run this report later, you may encounter problems, because it depends on the permissions of the user running the report in order to use the mminfo command to extract information from NetWorker. Such access might not be permitted to general users.

The -m option is obvious, because it is the email address to which to send the Savegroup Summary. If this is left off, the default of STDOUT is used. Its companion option is -S <string>, which specifies the subject of the Savegroup Summary being emailed. It defaults to the following: "Backups %E: %S - %G", where %S and %G are the same as the -F option, and the %E gives the status of the backup that completed, either "SUCCEEDED" or "FAILURES". The idea here is that if you miss reading email for several days, you can sort your inbox by "Subject:", and all the successful reports can be disposed of quickly. In this way, you can focus on the failures, since they are the most important.

Plans for future work include dynamic sizing of column widths in reports based on client hostname length, plotting/visualization tool for the data saved with the -l option, handling newer versions of NetWorker, and adding support for Veritas NetBackup.

Conclusion

NSS is still under development as time and energy permit, but the basic layout has stabilized over the past year because it does what I need it to do without muss or fuss.

In this article, I've tried to provide both a useful script and some pointers on the concepts you can use to write your own application for parsing arbitrary input and generating useful reports.

Resources

NSS Homepage -- http://jfs.ecotarium.org/sources/nss

Procmail -- http://www.procmail.org

Legato -- http://www.legato.org

NetWorker Users Mailing List -- http://listmail.temple.edu/archives/networker.html

John Stoffel attended Worcester Polytechnic Institute where he earned a degree in computer science and spent way too much time doing Theatre and Rock'n'Roll lighting on the side. He currently works as a senior UNIX sys admin for a not-so-large-anymore major telecommunications company. He is also a Board member of the USENIX Sage Certification program at http://www.sage-cert.org. He can be reached at: john@stoffel.org.