Cover V14, i01
jan2005.tar

Using the R System for Systems Administration

Mihalis Tsoukalos

This article is about R, which is an advanced statistical package with many complex capabilities. However, don't be afraid of R if you aren't very comfortable with mathematics and statistics. This article will cover some simple, useful capabilities of the package tailored for systems administrators.

R is a GNU project based on S, which is a statistics-specific language and environment developed at the AT&T Bell Labs. R is an interpreted computer language. The R system distribution supports many statistical procedures including linear and generalized linear models, nonlinear regression models, time series analysis, classical parametric and nonparametric tests, clustering, and smoothing. The current version of R is 2.0.0, which was released on October 4, 2004. For more information, visit the R Project home page (http://www.r-project.org). There is also a commercial implementation of S, called S-PLUS (http://www.insightful.com/), which has more facilities and capabilities than R. The examples presented in this article can also run in S-PLUS with little or no modifications.

Running the R System

R runs on Unix/Linux variants as well as on Windows. R can also run on Mac OS X Panther. There are GUIs for R, but all you need for the purposes of this article is the command-line version. The examples of this article have been written using R on Mac OS X Panther and Debian Linux.

To run R, type R (assuming that the R binary is in your PATH), which will show something like the following:

racoon:~/code/R $ R

R : Copyright 2004, The R Foundation for Statistical Computing
Version 1.9.0  (2004-04-12), ISBN 3-900051-00-3

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

 >
To quit R, just type q() at the prompt.

Basic Commands of the R System

First, the commands for inserting, naming, and selecting data are presented. The following example creates a data set (actually a vector) called SYSADMIN. This data set contains the 0th to 6th powers of number 3. To view the data in an existing data set, just type its name at the R prompt:

 > SYSADMIN <- 3^(0:6)
 > SYSADMIN
[1]   1   3   9  27  81 243 729
 >
The notation 0:6 returns a sequence for 0 to 6, including 0 and 6, which is a total of seven numbers. The names() command allows you to access the elements of a vector by a given name. In this example, the numbers 0 to 6 are used:

 > names(SYSADMIN) <- 0:6
 > SYSADMIN
   0   1   2   3   4   5   6
   1   3   9  27  81 243 729
 > class(SYSADMIN)
[1] "numeric"
 >
If you want to remove the names you gave to the elements of the vector with the names(SYSADMIN) command, you can use the following command:

names(SYSADMIN) <- NULL
The following lines show the advantages of calling a vector by name:

 > SYSADMIN[0]
named numeric(0)
 > SYSADMIN["0"]
0
1
 > SYSADMIN[1]
0
1
 >
You can now call the 0th power of 3 using SYSADMIN["0"], but not by using SYSADMIN[1] instead (as the 0th power is the first element of the vector), which is less descriptive. Note that the first command is not valid because there is no 0th element.

If you want to remove the names you gave to the elements of the vector with the names(SYSADMIN) command, you can use the following command:

names(SYSADMIN) <- NULL
For inserting data in an existing data set, you can use the method in the following example:

> SYSADMIN["7"] <- 3^7
> SYSADMIN
   0    1    2    3    4    5    6    7
   1    3    9   27   81  243  729 2187
>
The following command illustrates how to select specific ranges of data from the SYSADMIN vector using indices:

> SYSADMIN[3:4]
2 3
9 27
>
To delete a data set, use the rm() command. To delete the SYSADMIN data set, type rm(SYSADMIN) at the R prompt. Also, the objects() command must be used for listing the available objects.

The summary() command (the output will be explained in more detail later) gives useful information about the data set. For this example, the data set is called CALAMARIS. Data is taken from a Calamaris report file. Calamaris is a program for analyzing proxy server log files (a Squid log file was used here). Table 1 shows the data and explains the meaning of each column. The table was saved in a file called CALAMARIS.data and loaded into R with the following command:

CALAMARIS <-  
read.table("/Users/mtsouk/docs/article/R.SysAdmin/examples/
CALAMARIS.data", header=TRUE )
> summary(CALAMARIS)
  domain   number.of.requests  percent.of.total.requests  Total.Bytes
*.ca :1    Min. : 3.0          Min. : 0.09                Min. : 7919
*.com :1   1st Qu.: 32.0       1st Qu.: 0.94              1st Qu.: 114303
*.de :1    Median : 127.0      Median : 3.75              Median : 450469
*.edu :1   Mean : 376.8        Mean :11.11                Mean :1656217
*.gr :1    3rd Qu.: 187.0      3rd Qu.: 5.51              3rd Qu.:1249501
*.net :1   Max. :1403.0        Max. :41.37                Max. :7649799
(Other):3
> 
There is also a very handy way for representing a data set graphically. Figure 1 shows the output of the pairs() command. Again, the CALAMARIS data set is used. What you see in Figure 1 is the graphical representation of all the subsets of the CALAMARIS data set in pairs.

R supports the following types of objects:

  • Vectors (the most important objects in R)
  • Matrices (arrays)
  • Factors
  • Lists
  • Data frames
  • Functions

For more information about those objects, refer to the documentation that comes with your R installation.

Advanced Commands of the R System

The save() command is used for dumping an object to disk in order to use it later:

 > save(SYSADMIN, file = "/Users/mtsouk/SYSAMIN.r")
To read data from a file, use the load() command:

 > rm(SYSADMIN)
 > SYSADMIN
Error: Object "SYSADMIN" not found
 > load( file = "/Users/mtsouk/SYSAMIN.r" )
 > SYSADMIN
    0    1    2    3    4    5    6    7
    1    3    9   27   81  243  729 2187
 >
With the edit() command, the editor presents the data set ready for editing. I think this is very practical. The R package can also import data from various formats and database systems including PostgreSQL and database sources supporting the ODBC interface. R can also communicate via BSD sockets. For more information, refer to:

http://developer.r-project.org/db
The merge() command can be very useful because it works similarly to database joins, which means that related tables of data can be combined into one table. The following is a complete example of merge():

   > SERVER
           Name             OS        Version
   1      Pluto        Solaris              8
   2      Plato   Linux_Debian         Stable
   3     Racoon            AIX             5L
   4        Pik   Linux_Debian       Unstable
   5    Eugenia    Solaris_x86              9
   > ADMIN
        Machine     Admin_Name  Admin_Surname
   1      Pluto            Tom        Philips
   2    Eugenia           Anna          Tomas
   3      Plato            Jim   Papadopoulos
   4     Racoon          Peter          McRay
   5        Pik           John          Papas
   > merge(SERVER, ADMIN, by.x="Name", by.y="Machine")
           Name             OS        Version  Admin_Name  Admin_Surname
   1    Eugenia    Solaris_x86              9        Anna          Tomas
   2        Pik   Linux_Debian       Unstable        John          Papas
   3      Plato   Linux_Debian         Stable         Jim   Papadopoulos
   4      Pluto        Solaris              8         Tom        Philips
   5     Racoon            AIX             5L       Peter          McRay
   >
   

A Mail Server Application

Log files from a Postfix mail server are going to be used in this simple application. The data of interest in the log files includes the main DNS domain (.gr, .com, etc.) of the outgoing mail address, the delay duration (in seconds), and the time (in HH:MM:SS format) of the day. For getting the data, grep, sed, and awk were used. (Perl or another script language could have been used instead.) The first 10 lines of the data, including the column titles, are shown in Table 2.

Extracting Information

What information can we get from the data using R? Summary info (using the summary() command) can be extracted, which in this particular case gives:

 > summary(MAILDATA)
        Time     Domain        Delay
  11:07:12:  5   au :  3   Min.   :  1.00
  08:51:05:  3   com: 10   1st Qu.:  2.00
  13:23:47:  3   edu:  2   Median :  3.00
  06:12:53:  2   gr :117   Mean   : 11.38
  16:42:34:  2   org: 11   3rd Qu.:  6.00
  00:52:50:  1   uk :  2   Max.   :217.00
  (Other) :129
 >
This tells us that most of our emails go to the .GR domain and that the busiest moment (relatively busy because those log files were from my home dial-up server) is 11:07:12. Instead of Time, you can use Day, Week, Month, or even Year variables for getting mail information. The fact that the 3rd Qu. value is very close to the Median means that there are not major delays in the sending of the outgoing messages process, at least for the 75% of the items in the data set. If you want more precise information, you can divide the data set into smaller data sets.

Output Explanation

The Time and Domain data are not numbers, so R sums the occurrences (considering each value as a string) of each "string" and prints the top numbers. As far as Delay (which is numeric) is concerned, R calculates and displays the following six values:

  • Min. -- This is the minimum value of the data set.

  • Median -- This is an element that divides the data set into two subsets (left and right subsets) with the same number of elements. If the data set has an odd number of elements, then the Median is part of the data set. On the other side, if the data set has an even number of elements, then the Median is the mean value of the two center elements of the data set.

  • 1st Qu. -- The 1st Quartile q1 is a value, not necessarily belonging to the data set, with the property that (at most) 25% of the data set values are smaller than q1 and (at most 75%) of the data set values are bigger than q1. You can consider it as the Median of the left half subset of the sorted data set. In the case that the number of elements of the data set is such that q1 does not belong to the data set, it is produced by interpolation of the two values at the left (v) and the right (w) of its position to the sorted data set as:

    q1 = 0.75 * v + 0.25 * w
    
  • Mean -- This is the mean value of the data set (the total sum divided by the number of the items in the data set).

  • 3rd Qu. -- The 3rd Quartile q3 is a value, not necessarily belonging to the data set, with the property that (at most) 75% of the data set values are smaller than q3 and (at most) 25% of the data set values are bigger than q3. You can consider it as the Median of the right half subset of the sorted data set. In the case that the number of elements of the data set is such that q3 does not belong to the data set, it is produced by interpolation of the two values at the left (v) and the right (w) of its position to the sorted data set as:

    q3 = 0.25 * v + 0.75 * w
    
    Referring back to the example, the fact that the 3rd Qu. value is very close to the Median means that there are not major delays in the sending of the outgoing messages process, at least for the 75% of the items in the data set. If you want more precise information, you can divide the data set into smaller data sets.

  • Max. -- This is the maximum value in the data set. Please note that many definitions for finding Quartiles exist. If you try another statistical package, you may get different results.

    Using the pairs() command (output shown in Figure 2), shows a graphical overview of the data. From this image, and especially from the Time-Delay pair, you can conclude that there are not major delays. Also, imagine that you can automate this procedure and have the information sent to your email.

    By using the attach() command with a data set as an argument, you can use the columns of the data set as individual data sets. Thus, you can try the hist(Delay) command to draw a histogram of the frequencies of the delays (after giving attach(MAILDATA)) and get a more accurate view of the delay times. By executing hist(Delay, xlab="Delay in seconds", ylab="Number of emails", labels=TRUE) you get the plot shown in Figure 3.

    A Web Server Application

    For this example application, data is taken from a log file of a Web server. The duration of the log file is one day. Again, the data is taken using a combination of the sed, awk, and grep utilities.

    The first 10 lines of the data, including the column titles, are shown in Table 3.

    Note that the underscore in front of the status code was added so that the StatusCode value will not be considered a numeric value by R.

    The summary(WWWDATA) command gives the following output:

     > summary(WWWDATA)
           Time          ServerBytes        ClientBytes       StatusCode
      10:46  :   3145   Min.   :       0   Min.   :   0.0   _304   :709255
      10:58  :   3081   1st Qu.:     140   1st Qu.: 401.0   _200   :435146
      10:55  :   3066   Median :     142   Median : 435.0   _302   :  7371
      10:37  :   3054   Mean   :    2460   Mean   : 438.1   _404   :  4641
      10:32  :   2959   3rd Qu.:     407   3rd Qu.: 470.0   _500   :  3983
      09:30  :   2814   Max.   :49083902   Max.   :2158.0   _206   :  2254
      (Other):1144676                                       (Other):   145
     >
    
    Notice that the busiest minute was 10:46 when 3145 requests were served. Again, note that the underscore in front of the status code was added so that the StatusCode value will not be considered a numeric value by R.

    For more analysis, get all the data for the 12:00 to 12:59 timeframe (grep '^12' WWWDATA.data). This data set is named WWW12. Execute the pairs(WWW12) command. The output is shown in Figure 4.

    Also, the summary(WWW12) command gives the following output:

     > summary(WWW12)
           Time        ServerBytes       ClientBytes       StatusCode
      12:20  : 2003   Min.   :      0   Min.   :   0.0   _304   :45986
      12:24  : 1848   1st Qu.:    141   1st Qu.: 403.0   _200   :28914
      12:55  : 1800   Median :    142   Median : 436.0   _302   :  570
      12:16  : 1789   Mean   :   2273   Mean   : 444.6   _404   :  292
      12:01  : 1744   3rd Qu.:    407   3rd Qu.: 480.0   _500   :  214
      12:19  : 1713   Max.   :2631733   Max.   :1230.0   _206   :  124
      (Other):65217                                      (Other):   14
     >
    
    See Table 4.

    The main benefit of using R for systems administration is that you get a different perspective of your data, which can be useful as well as informative.

    Acknowledgments

    I would like to thank Nikos Platis and Manolis Skopelitis for helping me write this article.

    References

    Venables, W.N. and B.D. Ripley. Modern Applied Statistics with S, 4th Ed. Springer-Verlag, 2002. -- http://www.stats.ox.ac.uk/pub/MASS4

    R Project home page -- http://www.r-project.org

    StatLib -- http://lib.stat.cmu.edu/

    S-PLUS -- http://www.insightful.com/

    R and DBMSs page -- http://developer.r-project.org/db/

    Mihalis Tsoukalos lives in Greece with his wife, Eugenia, and works as a High School Teacher. He holds a B.Sc. in Mathematics and a M.Sc. in IT from University College London. Before teaching, he worked as a Unix systems administrator and an Oracle DBA. Mihalis can be reached at: tsoukalos@sch.gr.

  •