Cover V13, i06

Article

jun2004.tar

DTrace -- Most Exposing Solaris Tool Ever

Peter Baer Galvin

DTrace is a powerful new tool that's part of the Solaris 10 release and is available in pre-release via the Software Express for Solaris mechanism discussed in the April 2004 Solaris Companion. Because it is unique, DTrace is a bit difficult to describe. In this column, I'll summarize the features of DTrace, but I'll leave it to the Solaris kernel engineers who wrote DTrace to explore it with me in a series of questions and answers. I think that by the time you are finished hearing the engineers talk about DTrace, and once you experience it yourself, you'll agree with me that it's a brilliant piece of work that adds greatly to the ability to understand the workings of Solaris.

Q&A with the DTrace Team

The following questions and answers were excerpted from an email discussion with Bryan Cantrill, Michael Shapiro, and Adam Leventhal, of Sun's Solaris Kernel Development team:

    Q: How large was the team that worked on DTrace?

    A: There were three engineers on the team -- we're a big believer in the power of small, highly motivated teams.

    Q: What was the motivation for DTrace?

    A: Bluntly, when the system misbehaved, we had only ad hoc techniques for figuring out what the hell was going on. One incident is particularly illustrative of this: in December 1997, we were called in on a benchmarking crisis on what was then Sun's biggest machine -- a 64 processor E10K.

    When running the benchmark, the machine would periodically enter a state where performance would drop off exponentially. The machine would stay in this state for up to several minutes, after which time it seemed to magically recover. Those running the benchmark had determined that eliminating the "bad state" would allow us to blow away the world record on this particular benchmark -- so there was a tremendous amount of pressure within the company to get this problem resolved.

    We had the company's top experts on the problem, and we were working around-the-clock using every ad hoc black art technique at our disposal -- much of which consisted of adding custom instrumentation to kernel modules to hone in on the problem. Adding this instrumentation required delicacy and precision -- and it required rebooting the system every time we wanted to add more instrumentation. This system took almost two hours to reboot(!), so this alone was a serious obstacle to moving quickly on the problem. To wit: even with the top experts at the company working on the problem, it took us nearly a week to fully understand it. And here's the kicker: the problem turned out to be a simple misconfiguration.

    The system had been erroneously configured to be an IPv4 router (i.e., /etc/notrouter and /etc/defaultrouter were both unset), and it would occasionally start to furiously route packets between two other machines related to the benchmark. This (unnecessary) routing activity became so substantial that it induced contention in the IP layer of the networking stack, which in turn caused the benchmark activity to drop to practically nil. Once /etc/defaultrouter was properly set, the problem went away -- and the machine went on to smash the world record on that particular benchmark.

    While it was a relief to have found the problem, we emerged from the experience resolved to find a better way to diagnose our systems. By that time, we had already begun thinking about dynamic instrumentation, but after this experience we became intent on seeing our ideas become reality. In designing and implementing DTrace, we have always carried this particular experience with us: we wanted to build the tool that would have allowed us to immediately identify the root cause of this kind of problem.

    Q: How long has DTrace been under development?

    A: It took us just shy of two years to implement DTrace as it exists today in Solaris Express -- but we're still busily adding enhancements to DTrace, so by the time Solaris 10 is generally available, it will reflect that much more time of active DTrace development.

    Q: As far as I know, there has never been a tool like DTrace in any operating system. Did it have predecessors? If not, what were your influences and inspirations in creating it?

    A: There are no real predecessors per se -- which is not to say that all of the ideas are new; for example, Barton Miller's work on dynamic instrumentation at the University of Wisconsin was particularly inspirational. While we were somewhat influenced by earlier work, DTrace nevertheless represents a significant divergence from it. While most of the previous work in this domain has been aimed at understanding an application in development, DTrace is aimed explicitly at understanding a system in production.

    Q: Although DTrace is only available as part of Solaris 10, you've already used it at client sites to solve production problems. How?

    A: We have indeed used DTrace to solve performance problems (some of them very serious!) on customer machines. But it's a bit tough to walk you through it, because one typically uses DTrace many times before one understands the root-cause. Part of the power of DTrace is that it allows quick iterative analysis: the output from one DTrace enabling often raises a new question that leads to a new DTrace enabling. It's very rare that one can walk up to a slow app, run DTrace once, and have the answer -- so it's difficult to walk you through a complete example without showing how the output of one script prompted us to write the next.

    That said, here's a relatively simple example that came up recently: in a hallway conversation here in Menlo Park, an engineer tasked with improving process start time wondered how many minor faults a process typically takes between fork()'ing and exec()'ing. The answer could guide improvements to the operating system. That is, if the number of pages faulted in between a fork() and exec() were very small, it might be worth considering a scheme in which individual pagefaults become more expensive to reduce the cost associated with a fork() or exec(). Historically, we would have determined this primarily via thought experiment -- and if we had wanted to know a better answer we would have truss()'d a few random processes (using the -m fltpage option to truss) and crossed our fingers that the data was representative. And, if we had wanted to know the exact answer, someone would have had to generate a kernel with custom instrumentation -- an incredibly time-consuming task requiring specialized expertise.

    But with DTrace, we wrote the following script:

    proc:::start
    {
    self->pf = 0;
    }
    
    vminfo:::as_fault
    {
    self->pf++;
    }
    
    proc:::exec
    {
    @[execname] = lquantize(self->pf, 0, 100);
    self->pf = 0;
    }
    
    We ran this on our local production machines (NFS servers, build servers, and the like). The output was a histogram of the number of pagefaults between fork() and exec() for each application name. The answer was somewhat surprising: most apps take ~30-45 page faults between a fork() and an exec(). Here, for example, was the histogram for ksh:

    ksh                                               
             value  ------------- Distribution ------------- count
                28 |                                         0    
                29 |@@@                                      2    
                30 |                                         0    
                31 |@@@@@@@@@@@@@                            8    
                32 |@@@                                      2    
                33 |@@@@@@                                   4    
                34 |@@@                                      2    
                35 |@@@@@@@@                                 5    
                36 |                                         0    
    
    This was higher than the engineer had anticipated, and the answer immediately redirected his thinking -- saving Sun the cost of even prototyping a solution to the wrong problem! The time from question to answer was about five minutes, and the data was gathered from production machines!

    Here's another example that came up recently: as many people probably know, some programs have reason at one time or another to close all of their file descriptors. Many programs do this by simply iterating over all file descriptors and closing them -- regardless of whether they're open or not. Especially if RLIMIT_NOFILE is set high, this can be painful. To allow applications to do this without quite so much pain, we added closefrom(3C) to Solaris quite some time ago. It recently occurred to someone to find applications that were candidates for closefrom(3C), so we wrote the following D script to find closes of long runs of file descriptors that return EBADF:

    #!/usr/sbin/dtrace -s
    
    #pragma D option quiet
    
    syscall::close:entry
    {
            self->arg = arg0;
    }
    
    syscall::close:return
    /self->arg && errno == EBADF &&
        (self->arg == self->last + 1 ||
        self->arg == self->last - 1 || self->last == 0)/
    {
            self->run++;
            self->last = self->arg;
            self->arg = 0;
    }
    
    syscall::close:return
    /self->arg/
    {
            self->arg = 0;
            self->last = 0;
    }
    
    syscall:::entry
    /!self->arg && self->run && self->run < 10/
    {
            self->run = 0;
            self->last = 0;
    }
    
    syscall:::entry
    /!self->arg && self->run/
    {
            @[execname] = lquantize(self->run, 0, 300, 10);
            self->run = 0;
            self->last = 0;
    }
    
    profile:::tick-1sec
    {
            printa(@);
    }
    
    Running this on our local production mail server (1000+ users) yielded the following output:

      mutt                                              
               value  ------------- Distribution ------------- count  
                 230 |                                         0   
                 240 |@@@@@@@@@@@@@@@@@@@@                     2   
                 250 |@@@@@@@@@@@@@@@@@@@@                     2   
                 260 |                                         0   
    
      slocal                                            
               value  ------------- Distribution ------------- count    
                 240 |                                         0   
                 250 |@@@@@@@@@@@@@@@@@@@@                     11  
                 260 |                                         0   
    
      sendmail                              
               value  ------------- Distribution ------------- count    
                 230 |                                         0   
                 240 |@@@@@@@@                                 6   
                 250 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          22  
                 260 |                                         0   
    
    Sendmail has since been changed to use closefrom(3C); the others are under investigation. (And running the script on a build server revealed that "javac" is also a candidate application!)

    This is a good example of simple information that was essentially impossible to gather (or certainly impossible to gather easily) on production machines prior to DTrace.

    Q: DTrace comes with its own new language, "D". Was a new language necessary?

    A: We don't view "D" as a totally new language; instead it's the companion language to C for dynamic tracing. Over 90% of D's syntax is simply ANSI C. To this familiar and stable base we added a small set of enhancements, which generally divide into two major categories: a program structure suitable for a tracing request rather than a traditional linear program structure, and a set of primitives for more advanced data types such as associative arrays and aggregations so that D programmers don't have to implement data structures.

    A tracing program is unlike a traditional program in that rather than proceeding in some order defined by the programmer, the program acts as a sort of symbiotic entity that executes based on the activity of one or more other programs that are being traced. For this reason, the D program structure uses a set of clauses that match probes rather than traditional functions that call one another.

    Many tracing programs are also very short-lived, perhaps even typed on the command line, as the user quickly asks a series of questions specific to a particular problem being investigated. For this reason, we wanted to give our users easy access to sophisticated data structures such as associative arrays and first-class manipulation of strings. Traditional programming languages would have added a lot of unnecessary overhead to these tasks. So we took as our base the language of UNIX, C, and created a companion optimized for use in tracing activities.

    Q: How can a tool have a userland compiler, create kernel triggers, take no resources when not used, and still be secure and well performing?

    A: By being designed very carefully to meet those requirements :). Seriously, though, this question highlights some of the key problems we specifically set out to solve with DTrace and that we had to overcome in order to make DTrace effective for use on production systems.

    The DTrace compiler operates in userland because it simply converts requests from the high-level D programming language into a byte code for a very small, simple virtual machine designed specifically for DTrace. As instrumentation requests are loaded in the kernel, it performs a set of static safety checks before enabling the instrumentation, and other dynamic safety checks as virtual machine instructions are emulated. These safety checks are implemented in the kernel using the very simple byte code to ensure that an errant program, broken compiler, or malicious user cannot damage the system using DTrace.

    Our forthcoming USENIX paper describes the safety issues in further detail. In addition to safety considerations, DTrace makes use of the new Solaris 10 Least Privilege technology to offer three separate privileges that can be used to control how much information a user can access with DTrace. These privilege bits permit developers to be able to examine only their own processes, whereas systems administrators can perform system-wide analysis. Together, DTrace's safety and security mechanisms mean DTrace is safe for use in production.

    DTrace consumes no resources when not in use because we made all of its instrumentation dynamic: a system where DTrace is not being used is effectively the same as a system where DTrace is not even installed. Because all of the instrumentation is enabled dynamically and discretely, and the D language allows you to express your questions very concisely and filter data at the source, the overhead of applying DTrace scales with your use and usually provides very low probe effect. These benefits could not be achieved by previous static tracing systems where a single global switch enabled tracing.

    Q: What kinds of problems can DTrace help solve?

    A: Just about anything that can be phrased as a question -- DTrace allows you to answer pretty much any question you may have about your system. For example, you might wish to know "what application is causing all of the cross calls?" Answering this question -- which has been historically very difficult to answer -- is as easy as dtrace -n xcalls'{@[execname] = count()}'. And that's just the beginning: often, one finds that the answer to one question raises another question. With DTrace, it's very easy to quickly iterate from one answer to the next question.

    Q: What kinds of things does DTrace not solve? For example, getting knowledge of the performance of network interfaces, storage interfaces, and system busses has been, ahem, a "challenge" in previous Solaris releases. Does DTrace help there?

    A: Absolutely. Take network interfaces, for example. As you alluded to, getting network performance numbers has traditionally been very difficult with Solaris. With DTrace, getting the answer to a question like "what applications are sending what network traffic?" is easy:

    # dtrace -n tcp_wput:entry'{@[execname] = quantize(msgdsize(args[1]))}'
    
    For example, some output of running the above on a mail and NFS server:

    save 
    value ------------- Distribution ------------- count 
    -1 | 0 
    0 | 5 
    1 | 0 
    2 | 0 
    4 | 0 
    8 | 0 
    16 | 0 
    32 | 1 
    64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 54958 
    128 | 71 
    256 | 30 
    512 | 10 
    1024 | 8 
    2048 | 23 
    4096 | 98 
    8192 | 189 
    16384 | 5 
    32768 | 0
    imapd
    value ------------- Distribution ------------- count 
    -1 | 0 
    0 | 1269 
    1 | 0 
    2 |@ 2483 
    4 | 18 
    8 |@ 2149 
    16 |@@@@ 5892 
    32 |@@@@@ 7927 
    64 | 764 
    128 | 565 
    256 |@@@@@@@@@@ 13508 
    512 |@@@@ 5999 
    1024 |@@@@@@@@@ 12201 
    2048 | 47 
    4096 | 28 
    8192 | 395 
    16384 | 0
    nfsd 
    value ------------- Distribution ------------- count 
    -1 | 0 
    0 | 1765 
    1 | 0 
    2 | 0 
    4 | 0 
    8 | 135 
    16 | 181 
    32 |@@@@@@@@@@@ 182137 
    64 |@@@@@@@@@@@@@@@@@@@@@ 323422 
    128 |@@@@@ 84998 
    256 | 9420 
    512 | 723 
    1024 | 2170 
    2048 | 191 
    4096 | 245 
    8192 | 1116 
    16384 | 617 
    32768 | 1662 
    65536 | 0
    
    This is roughly as we would expect, but one surprise is that the backup software ("save") seems to be sending small packets. Why is it doing that? Is that a failure mode of some sort, or is that normal operation? And is there an opportunity for optimization here? It's a good example of how one DTrace enabling often raises questions just as it answers them, and it also raises another important point: by itself, DTrace does not solve performance problems. Rather, DTrace is a tool by which these problems can be understood. With DTrace, performance analysis becomes much less labor intensive, but much more brain intensive -- and a hell of a lot more fun!

Closing Comments

DTrace is certainly an impressive and powerful tool. I've been experimenting quite a bit with it, and it appears to meet the analysis and debugging needs of kernel developers, application developers, and systems administrators alike. The manual is available as part of Software Express for Solaris, and at first glance is daunting. However, it is very clearly written, in a tutorial plus reference style. If you know C or another reasonable programming language, it won't take much time to get up to speed and start reaping the benefits of this greatly expanded visibility into the workings of Solaris and the processes running on it.

I thank the Sun's Solaris Kernel Development team for taking the time to provide this information about DTrace. For more information about DTrace, see the manuals with Sun Express for Solaris. There is also a discussion board about DTrace at:

http://www.sun.com/bigadmin/content/dtrace/
Also, the kernel engineers are giving a talk about DTrace at upcoming Boston Usenix '04 conference (http://www.usenix.com/events/usenix04/). I'll also be there teaching "Advanced Topics in Solaris Administration". I hope to see you there.

Peter Baer Galvin (http://www.petergalvin.info) is the Chief Technologist for Corporate Technologies (www.cptech.com), a premier systems integrator and VAR. Before that, Peter was the systems manager for Brown University's Computer Science Department. He has written articles for Byte and other magazines, and previously wrote Pete's Wicked World, the security column, and Pete's Super Systems, the systems management column for Unix Insider (http://www.unixinsider.com). Peter is coauthor of the Operating Systems Concepts and Applied Operating Systems Concepts textbooks. As a consultant and trainer, Peter has taught tutorials and given talks on security and systems administration worldwide.