Using
the R System for Systems Administration
Mihalis Tsoukalos
This article is about R, which is an advanced statistical package
with many complex capabilities. However, don't be afraid of R if
you aren't very comfortable with mathematics and statistics. This
article will cover some simple, useful capabilities of the package
tailored for systems administrators.
R is a GNU project based on S, which is a statistics-specific
language and environment developed at the AT&T Bell Labs. R
is an interpreted computer language. The R system distribution supports
many statistical procedures including linear and generalized linear
models, nonlinear regression models, time series analysis, classical
parametric and nonparametric tests, clustering, and smoothing. The
current version of R is 2.0.0, which was released on October 4,
2004. For more information, visit the R Project home page (http://www.r-project.org).
There is also a commercial implementation of S, called S-PLUS (http://www.insightful.com/),
which has more facilities and capabilities than R. The examples
presented in this article can also run in S-PLUS with little or
no modifications.
Running the R System
R runs on Unix/Linux variants as well as on Windows. R can also
run on Mac OS X Panther. There are GUIs for R, but all you need
for the purposes of this article is the command-line version. The
examples of this article have been written using R on Mac OS X Panther
and Debian Linux.
To run R, type R (assuming that the R binary is in your
PATH), which will show something like the following:
racoon:~/code/R $ R
R : Copyright 2004, The R Foundation for Statistical Computing
Version 1.9.0 (2004-04-12), ISBN 3-900051-00-3
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
>
To quit R, just type q() at the prompt.
Basic Commands of the R System
First, the commands for inserting, naming, and selecting data are
presented. The following example creates a data set (actually a
vector) called SYSADMIN. This data set contains the 0th to 6th powers
of number 3. To view the data in an existing data set, just type
its name at the R prompt:
> SYSADMIN <- 3^(0:6)
> SYSADMIN
[1] 1 3 9 27 81 243 729
>
The notation 0:6 returns a sequence for 0 to 6, including 0 and 6,
which is a total of seven numbers. The names() command allows
you to access the elements of a vector by a given name. In this example,
the numbers 0 to 6 are used:
> names(SYSADMIN) <- 0:6
> SYSADMIN
0 1 2 3 4 5 6
1 3 9 27 81 243 729
> class(SYSADMIN)
[1] "numeric"
>
If you want to remove the names you gave to the elements of the vector
with the names(SYSADMIN) command, you can use the following
command:
names(SYSADMIN) <- NULL
The following lines show the advantages of calling a vector by name:
> SYSADMIN[0]
named numeric(0)
> SYSADMIN["0"]
0
1
> SYSADMIN[1]
0
1
>
You can now call the 0th power of 3 using SYSADMIN["0"], but
not by using SYSADMIN[1] instead (as the 0th power is the first
element of the vector), which is less descriptive. Note that the first
command is not valid because there is no 0th element.
If you want to remove the names you gave to the elements of the
vector with the names(SYSADMIN) command, you can use the
following command:
names(SYSADMIN) <- NULL
For inserting data in an existing data set, you can use the method
in the following example:
> SYSADMIN["7"] <- 3^7
> SYSADMIN
0 1 2 3 4 5 6 7
1 3 9 27 81 243 729 2187
>
The following command illustrates how to select specific ranges of
data from the SYSADMIN vector using indices:
> SYSADMIN[3:4]
2 3
9 27
>
To delete a data set, use the rm() command. To delete the SYSADMIN
data set, type rm(SYSADMIN) at the R prompt. Also, the
objects() command must be used for listing the available objects.
The summary() command (the output will be explained in
more detail later) gives useful information about the data set.
For this example, the data set is called CALAMARIS. Data is taken
from a Calamaris report file. Calamaris is a program for analyzing
proxy server log files (a Squid log file was used here). Table 1
shows the data and explains the meaning of each column. The table
was saved in a file called CALAMARIS.data and loaded into R with
the following command:
CALAMARIS <-
read.table("/Users/mtsouk/docs/article/R.SysAdmin/examples/
CALAMARIS.data", header=TRUE )
> summary(CALAMARIS)
domain number.of.requests percent.of.total.requests Total.Bytes
*.ca :1 Min. : 3.0 Min. : 0.09 Min. : 7919
*.com :1 1st Qu.: 32.0 1st Qu.: 0.94 1st Qu.: 114303
*.de :1 Median : 127.0 Median : 3.75 Median : 450469
*.edu :1 Mean : 376.8 Mean :11.11 Mean :1656217
*.gr :1 3rd Qu.: 187.0 3rd Qu.: 5.51 3rd Qu.:1249501
*.net :1 Max. :1403.0 Max. :41.37 Max. :7649799
(Other):3
>
There is also a very handy way for representing a data set graphically.
Figure 1 shows the output of the pairs() command. Again, the
CALAMARIS data set is used. What you see in Figure 1 is the graphical
representation of all the subsets of the CALAMARIS data set in pairs.
R supports the following types of objects:
- Vectors (the most important objects in R)
- Matrices (arrays)
- Factors
- Lists
- Data frames
- Functions
For more information about those objects, refer to the documentation
that comes with your R installation.
Advanced Commands of the R System
The save() command is used for dumping an object to disk
in order to use it later:
> save(SYSADMIN, file = "/Users/mtsouk/SYSAMIN.r")
To read data from a file, use the load() command:
> rm(SYSADMIN)
> SYSADMIN
Error: Object "SYSADMIN" not found
> load( file = "/Users/mtsouk/SYSAMIN.r" )
> SYSADMIN
0 1 2 3 4 5 6 7
1 3 9 27 81 243 729 2187
>
With the edit() command, the editor presents the data set ready
for editing. I think this is very practical. The R package can also
import data from various formats and database systems including PostgreSQL
and database sources supporting the ODBC interface. R can also communicate
via BSD sockets. For more information, refer to:
http://developer.r-project.org/db
The merge() command can be very useful because it works similarly
to database joins, which means that related tables of data can be
combined into one table. The following is a complete example of merge():
> SERVER
Name OS Version
1 Pluto Solaris 8
2 Plato Linux_Debian Stable
3 Racoon AIX 5L
4 Pik Linux_Debian Unstable
5 Eugenia Solaris_x86 9
> ADMIN
Machine Admin_Name Admin_Surname
1 Pluto Tom Philips
2 Eugenia Anna Tomas
3 Plato Jim Papadopoulos
4 Racoon Peter McRay
5 Pik John Papas
> merge(SERVER, ADMIN, by.x="Name", by.y="Machine")
Name OS Version Admin_Name Admin_Surname
1 Eugenia Solaris_x86 9 Anna Tomas
2 Pik Linux_Debian Unstable John Papas
3 Plato Linux_Debian Stable Jim Papadopoulos
4 Pluto Solaris 8 Tom Philips
5 Racoon AIX 5L Peter McRay
>
A Mail Server Application
Log files from a Postfix mail server are going to be used in this
simple application. The data of interest in the log files includes
the main DNS domain (.gr, .com, etc.) of the outgoing mail address,
the delay duration (in seconds), and the time (in HH:MM:SS format)
of the day. For getting the data, grep, sed, and awk
were used. (Perl or another script language could have been used
instead.) The first 10 lines of the data, including the column titles,
are shown in Table 2.
Extracting Information
What information can we get from the data using R? Summary info
(using the summary() command) can be extracted, which in
this particular case gives:
> summary(MAILDATA)
Time Domain Delay
11:07:12: 5 au : 3 Min. : 1.00
08:51:05: 3 com: 10 1st Qu.: 2.00
13:23:47: 3 edu: 2 Median : 3.00
06:12:53: 2 gr :117 Mean : 11.38
16:42:34: 2 org: 11 3rd Qu.: 6.00
00:52:50: 1 uk : 2 Max. :217.00
(Other) :129
>
This tells us that most of our emails go to the .GR domain and that
the busiest moment (relatively busy because those log files were from
my home dial-up server) is 11:07:12. Instead of Time, you can use
Day, Week, Month, or even Year variables for getting mail information.
The fact that the 3rd Qu. value is very close to the Median means
that there are not major delays in the sending of the outgoing messages
process, at least for the 75% of the items in the data set. If you
want more precise information, you can divide the data set into smaller
data sets.
Output Explanation
The Time and Domain data are not numbers, so R sums the occurrences
(considering each value as a string) of each "string" and prints
the top numbers. As far as Delay (which is numeric) is concerned,
R calculates and displays the following six values:
Min. -- This is the minimum value of the data set.
Median -- This is an element that divides the data set into
two subsets (left and right subsets) with the same number of elements.
If the data set has an odd number of elements, then the Median
is part of the data set. On the other side, if the data set has
an even number of elements, then the Median is the mean value
of the two center elements of the data set.
1st Qu. -- The 1st Quartile q1 is a value, not necessarily
belonging to the data set, with the property that (at most) 25%
of the data set values are smaller than q1 and (at most 75%) of
the data set values are bigger than q1. You can consider it as
the Median of the left half subset of the sorted data set. In
the case that the number of elements of the data set is such that
q1 does not belong to the data set, it is produced by interpolation
of the two values at the left (v) and the right (w) of its position
to the sorted data set as:
q1 = 0.75 * v + 0.25 * w
Mean -- This is the mean value of the data set (the total sum
divided by the number of the items in the data set).
3rd Qu. -- The 3rd Quartile q3 is a value, not necessarily
belonging to the data set, with the property that (at most) 75%
of the data set values are smaller than q3 and (at most) 25% of
the data set values are bigger than q3. You can consider it as
the Median of the right half subset of the sorted data set. In
the case that the number of elements of the data set is such that
q3 does not belong to the data set, it is produced by interpolation
of the two values at the left (v) and the right (w) of its position
to the sorted data set as:
q3 = 0.25 * v + 0.75 * w
Referring back to the example, the fact that the 3rd Qu. value
is very close to the Median means that there are not major delays
in the sending of the outgoing messages process, at least for
the 75% of the items in the data set. If you want more precise
information, you can divide the data set into smaller data sets.
Max. -- This is the maximum value in the data set. Please note
that many definitions for finding Quartiles exist. If you try
another statistical package, you may get different results.
Using the pairs() command (output shown in Figure 2),
shows a graphical overview of the data. From this image, and
especially from the Time-Delay pair, you can conclude that there
are not major delays. Also, imagine that you can automate this
procedure and have the information sent to your email.
By using the attach() command with a data set as an
argument, you can use the columns of the data set as individual
data sets. Thus, you can try the hist(Delay) command
to draw a histogram of the frequencies of the delays (after
giving attach(MAILDATA)) and get a more accurate view
of the delay times. By executing hist(Delay, xlab="Delay
in seconds", ylab="Number of emails", labels=TRUE) you get
the plot shown in Figure 3.
A Web Server Application
For this example application, data is taken from a log file
of a Web server. The duration of the log file is one day. Again,
the data is taken using a combination of the sed, awk,
and grep utilities.
The first 10 lines of the data, including the column titles,
are shown in Table 3.
Note that the underscore in front of the status code was added
so that the StatusCode value will not be considered a numeric
value by R.
The summary(WWWDATA) command gives the following output:
> summary(WWWDATA)
Time ServerBytes ClientBytes StatusCode
10:46 : 3145 Min. : 0 Min. : 0.0 _304 :709255
10:58 : 3081 1st Qu.: 140 1st Qu.: 401.0 _200 :435146
10:55 : 3066 Median : 142 Median : 435.0 _302 : 7371
10:37 : 3054 Mean : 2460 Mean : 438.1 _404 : 4641
10:32 : 2959 3rd Qu.: 407 3rd Qu.: 470.0 _500 : 3983
09:30 : 2814 Max. :49083902 Max. :2158.0 _206 : 2254
(Other):1144676 (Other): 145
>
Notice that the busiest minute was 10:46 when 3145 requests were
served. Again, note that the underscore in front of the status
code was added so that the StatusCode value will not be considered
a numeric value by R.
For more analysis, get all the data for the 12:00 to 12:59
timeframe (grep '^12' WWWDATA.data). This data set is
named WWW12. Execute the pairs(WWW12) command. The output
is shown in Figure 4.
Also, the summary(WWW12) command gives the following
output:
> summary(WWW12)
Time ServerBytes ClientBytes StatusCode
12:20 : 2003 Min. : 0 Min. : 0.0 _304 :45986
12:24 : 1848 1st Qu.: 141 1st Qu.: 403.0 _200 :28914
12:55 : 1800 Median : 142 Median : 436.0 _302 : 570
12:16 : 1789 Mean : 2273 Mean : 444.6 _404 : 292
12:01 : 1744 3rd Qu.: 407 3rd Qu.: 480.0 _500 : 214
12:19 : 1713 Max. :2631733 Max. :1230.0 _206 : 124
(Other):65217 (Other): 14
>
See Table 4.
The main benefit of using R for systems administration is
that you get a different perspective of your data, which can
be useful as well as informative.
Acknowledgments
I would like to thank Nikos Platis and Manolis Skopelitis for
helping me write this article.
References
Venables, W.N. and B.D. Ripley. Modern Applied Statistics
with S, 4th Ed. Springer-Verlag, 2002. -- http://www.stats.ox.ac.uk/pub/MASS4
R Project home page -- http://www.r-project.org
StatLib -- http://lib.stat.cmu.edu/
S-PLUS -- http://www.insightful.com/
R and DBMSs page -- http://developer.r-project.org/db/
Mihalis Tsoukalos lives in Greece with his wife, Eugenia,
and works as a High School Teacher. He holds a B.Sc. in Mathematics
and a M.Sc. in IT from University College London. Before teaching,
he worked as a Unix systems administrator and an Oracle DBA.
Mihalis can be reached at: tsoukalos@sch.gr. |