How to Crawl Back Inside Your Shell
Shawn Bayern with Brian Carp
Searching for the perfect programming language is appealing; thinking you've found one is even better. When one tool looks like it can do everything, why look for anything else? Environments like Perl, for example, are extremely alluring -- and extremely useful -- because of their flexibility. But, at least in the past, UNIX systems administrators have rarely embraced monolithic solutions to problems. Programmers know why custom programming is often considered a last resort: programs are time-consuming to write, difficult to debug, and even harder to maintain. When at all possible, sysadmins want to take advantage of existing tools to solve problems.
Almost since its birth, UNIX has depended on a model of stringing small tools together to solve complex problems quickly and in a straightforward manner. The danger of large, flexible scripting languages like Perl is that they make programming look so easy that we are sometimes fooled into ignoring existing tools. "Sure, I can write that in Perl" is now a common first response to many problems; because of this, we might forget that a tool we need is already written, or only a few steps away from being written. We also might neglect, when writing our own new tools, to fit them into an existing and powerful framework.
Solving Complex Problems with Small Tools To track down a mail loop recently, I needed to determine how many messages were in a mailbox, and from which addresses they were sent. Ideally, I needed to know information along the lines of, "This mailbox contains 458 messages from bob@harvard.edu, 62 messages from root@localhost, etc." Writing a Perl script to handle this problem would have been moderately straightforward: break the mailbox apart into messages, split the appropriate lines, parse addresses, and count them. Then I would need to think about the problem more carefully: if I want the script to be accurate, how precisely should I break the mailbox into individual messages? Does the "From" line that constitutes a postmark need to be preceded by two newlines, or is one enough? I would have needed to read RFCs or look at existing code to make sure my solution was right.
Instead, I decided to use existing tools. formail is a standard tool on almost all platforms -- it's there to process mailboxes quickly. Using formail, I could easily split apart the mailbox into individual messages, and from there, the solution fell into place:
formail -s head -1 | tr '[A-Z]' '[a-z]' |
awk '{print $2}' | sort | uniq -c | sort -nr
To solve the problem, I did not need to write a program to read individual lines and cycle through them. All the logic in my proposed solution is at the top level. My small shell script can be read as, "Split the mailbox apart into messages, get the first line of each message, convert it to uppercase, extract the second field, sort the lines, figure out how many adjacent lines are the same, and sort the result numerically." An equivalent line-oriented script might follow a procedure more like, "Read in a line; if it starts with 'From ', then set the following variable to its second field..."
In other words, the interface between all the pieces in a pipeline is well defined: the programs simply read text and write text. Dealing with the problem at this level of abstraction is simpler and less fallible that writing functionally equivalent for() or while() loops to discard lines, parse fields, and so on. So, even if you found a Perl module that did some of the work for you, you'd still have to string together the pieces the way a programmer would -- with subroutines and variables whose relationship is defined by the subroutines. With a pipeline, you get to make more general assumptions that simplify the solution.
By the way, sort | uniq -c | sort -nr happens to be a useful idiom for counting and ranking almost anything. To determine how many users use each shell on your system, for example, you could run:
awk -F: '{print $7}' < /etc/passwd | sort | uniq -c | sort -nr
which lists each shell and how many users it has, the most popular first. Periodically running pipelines like this is a great way to find misspellings in key system configuration files. I found one unfortunate soul whose shell was set to /bin/csh' -- trailing quote and all -- when testing this pipeline.
Like the pipe, backquotes are a traditional syntactic feature that can simplify your solutions. Backquotes let you run a pipeline, retrieve the output, and interpolate the results in the middle of another command. They can be especially useful when used in conjunction with commands that perform a procedure over all their arguments, like cat, cp, and most other file-oriented commands. Just as in the message-counting script, you can use backquotes to deal with a group of things as a whole, rather than iterating over them manually.
For instance, I wanted to develop a technique to move all files referenced in my /etc/aliases file to a single directory. /etc/aliases has lines of the form:
peon-list: :include:/home/p1/peon/peon-list
I wanted to extract /home/p1/peon/peon-list, and all similar files, from the master aliasfile and copy them from their original locations to a centralized directory (in preparation for a migration away from a particular server). So, I ran:
cp 'cat /etc/aliases | \
egrep ':include:' | \
awk '{print $2}' | \
sed 's/^:include://'' /var/lists
as a small shell script, instead of writing a script that parsed /etc/aliases line-by-line and executed cp. As before, this let me approach the problem at a high and convenient level of abstraction. If cp were instead scp or rcp (commands that involve a much higher startup cost than cp because they set up a network connection) then this technique might even be faster at runtime. Of course, command lines can't be arbitrarily large, so you'll need to use xargs if your backquotes would produce too much output, but for relatively small tasks, backquotes can be a convenient way to leverage shell syntax to form complex tools from simpler ones.
Mastering shell scripting and constructing useful pipelines at the command line involves more than simply understanding shell syntax. You also need to become familiar with the programs that come with almost all varieties of UNIX and their uses. For instance, if you ever chop off the bottom of a file by using something other than head, or if you count lines with something other than wc, you're probably wasting effort. If you parse out a field in a line with a Perl regular expression, you may be wasting effort if cut or awk can handle it in a more straightforward manner.
Understanding the way that existing tools work will help you design your own tools that interface well with existing ones. For instance, my script above counts a mailbox's messages, reads from the standard input, and writes to the standard output. It can therefore be used in the middle of another pipeline -- say, one that added up the numbers on each line and printed a total number of messages.
Integrating Scripting and Programming So far, I've demonstrated some examples of what I've called "scripts" that are simpler than their functionally equivalent "programs." Distinguishing between programming and scripting is generally a pointless endeavor -- there's clearly a continuum between the two extremes, not a sharp dividing line. Conceptually, though, shell scripts (such as small files interpreted and executed by the Bourne shell, C shell, or their descendents) depend on running existing programs and leveraging simple syntax to connect small tools in a variety of ways. It is specifically this sort of "shell scripting" that takes advantage of the traditional UNIX model.
However, traditional shell scripts aren't the only things that can take advantage of the UNIX toolset. Most large scripting environments have an interface to the shell that lets you send commands out and get back a response.
Perl is a particularly excellent "glue language" that interfaces easily with almost everything. Even from inside a Perl script, you can take advantage of traditional UNIX tools, pipelines, and syntax if you decide that existing programs can solve your problem more easily than custom lines of code. And even from within a C program, you can use the system() and popen() functions to execute and communicate with other programs (though this is, of course, simpler to do in Perl).
For example, from within a Perl script, I once needed to figure out which days in a particular month were Saturdays. I had the month and year as Perl variables and had many options available for solving this problem. Date programming, in particular, is always hard to get perfect, and I realized I didn't have to bother because most of the work was already done for me.
I decided to use the cal program, which yields a text-based calendar for any month of any year (from 1 to 9999, at least). Perl supports backquotes in the same way traditional shells do, so I constructed a solution like this:
@saturdays = split " ",
'cal $month $year | sed '1,2d' | awk '{print \$NF}'';
That is, "Get the calendar, delete the first two lines, then print the final field in each line." Much easier than having to worry about Julian and Gregorian dates, leap years, and such, huh? And it's even easier than using a Perl module, looping over the days in a month, converting them to weekdays, and storing Saturdays as you go along.
The Role of Scripting Some administrators have been taught not to write shell scripts or construct long pipelines because of performance concerns. To respond to such concerns, first evaluate whether or not slight performance differences will matter: the old aphorism that a programmer's time is more important than CPU time is usually true, especially if you're writing a batch job that's going to run at night or a one-time hack to solve a small problem. If you absolutely can't justify anything short of a fast-running C program or Perl script, consider that there are at least a few cases where a shell script (or commands entered at the shell) might paradoxically be faster than a functionally equivalent C program. For instance, a shell script that executes a pipeline repeatedly may actually run faster than a C program that executes the same pipeline via a call to system(), because the C program needs to start an (intermediate) shell to run the pipeline. So, unless you want to implement the shell's functionality (via pipe(), fork(), and exec()) in your C program, your shell script may, surprisingly, beat out its lower-level counterpart.
Furthermore, while the UNIX toolkit is virtually omnipresent, Perl and even user-accessible C compilers aren't. If you're on a system where you can't compile a C program or install a Perl interpreter, understanding shell syntax and traditional UNIX tools -- sed, awk, grep, and so on -- will be helpful.
Of course, Perl and other modern scripting languages are indispensable. When they are compared to writing new C programs to solve each new problem that arises, their advantages are obvious. That doesn't make them a catch-all solution, or the first, best approach to every situation.
Tasks are often not as complex as they appear, especially because a good chunk of the work may already be written for you. Determining the incremental complexity of a task -- that is, what still needs to be done -- is the first step toward finding its simplest solution. To save time, understand what tools are available, how to string them together syntactically, and how to write simple shell scripts. Above all, when you're trying to solve problems as quickly as possible, think of programming (even simple programming, and perhaps even Perl programming) the way it deserves to be thought of -- as a last resort, a powerful yet difficult tool to wield, and one that should generally be avoided.
About the Author
Shawn Bayern is a systems programmer in Yale University's ITS Technology and Planning group. As a student and, more recently, a staff member at Yale, he has administered (and written scripts for) systems that serve over 20,000 users. Brian Carp is a student at Yale University. Shawn Bayern be reached at: shawn.bayern@yale.edu. |