DTrace
-- Most Exposing Solaris Tool Ever
Peter Baer Galvin
DTrace is a powerful new tool that's part of the Solaris 10 release
and is available in pre-release via the Software Express for Solaris
mechanism discussed in the April 2004 Solaris Companion. Because
it is unique, DTrace is a bit difficult to describe. In this column,
I'll summarize the features of DTrace, but I'll leave it to the
Solaris kernel engineers who wrote DTrace to explore it with me
in a series of questions and answers. I think that by the time you
are finished hearing the engineers talk about DTrace, and once you
experience it yourself, you'll agree with me that it's a brilliant
piece of work that adds greatly to the ability to understand the
workings of Solaris.
Q&A with the DTrace Team
The following questions and answers were excerpted from an email
discussion with Bryan Cantrill, Michael Shapiro, and Adam Leventhal,
of Sun's Solaris Kernel Development team:
Q: How large was the team that worked on DTrace?
A: There were three engineers on the team -- we're a big believer
in the power of small, highly motivated teams.
Q: What was the motivation for DTrace?
A: Bluntly, when the system misbehaved, we had only ad hoc techniques
for figuring out what the hell was going on. One incident is particularly
illustrative of this: in December 1997, we were called in on a benchmarking
crisis on what was then Sun's biggest machine -- a 64 processor
E10K.
When running the benchmark, the machine would periodically enter
a state where performance would drop off exponentially. The machine
would stay in this state for up to several minutes, after which
time it seemed to magically recover. Those running the benchmark
had determined that eliminating the "bad state" would allow us to
blow away the world record on this particular benchmark -- so there
was a tremendous amount of pressure within the company to get this
problem resolved.
We had the company's top experts on the problem, and we were working
around-the-clock using every ad hoc black art technique at our disposal
-- much of which consisted of adding custom instrumentation to kernel
modules to hone in on the problem. Adding this instrumentation required
delicacy and precision -- and it required rebooting the system every
time we wanted to add more instrumentation. This system took almost
two hours to reboot(!), so this alone was a serious obstacle to
moving quickly on the problem. To wit: even with the top experts
at the company working on the problem, it took us nearly a week
to fully understand it. And here's the kicker: the problem turned
out to be a simple misconfiguration.
The system had been erroneously configured to be an IPv4 router
(i.e., /etc/notrouter and /etc/defaultrouter were both unset), and
it would occasionally start to furiously route packets between two
other machines related to the benchmark. This (unnecessary) routing
activity became so substantial that it induced contention in the
IP layer of the networking stack, which in turn caused the benchmark
activity to drop to practically nil. Once /etc/defaultrouter was
properly set, the problem went away -- and the machine went on to
smash the world record on that particular benchmark.
While it was a relief to have found the problem, we emerged from
the experience resolved to find a better way to diagnose our systems.
By that time, we had already begun thinking about dynamic instrumentation,
but after this experience we became intent on seeing our ideas become
reality. In designing and implementing DTrace, we have always carried
this particular experience with us: we wanted to build the tool
that would have allowed us to immediately identify the root cause
of this kind of problem.
Q: How long has DTrace been under development?
A: It took us just shy of two years to implement DTrace as it
exists today in Solaris Express -- but we're still busily adding
enhancements to DTrace, so by the time Solaris 10 is generally available,
it will reflect that much more time of active DTrace development.
Q: As far as I know, there has never been a tool like DTrace in
any operating system. Did it have predecessors? If not, what were
your influences and inspirations in creating it?
A: There are no real predecessors per se -- which is not to say
that all of the ideas are new; for example, Barton Miller's work
on dynamic instrumentation at the University of Wisconsin was particularly
inspirational. While we were somewhat influenced by earlier work,
DTrace nevertheless represents a significant divergence from it.
While most of the previous work in this domain has been aimed at
understanding an application in development, DTrace is aimed explicitly
at understanding a system in production.
Q: Although DTrace is only available as part of Solaris 10, you've
already used it at client sites to solve production problems. How?
A: We have indeed used DTrace to solve performance problems (some
of them very serious!) on customer machines. But it's a bit tough
to walk you through it, because one typically uses DTrace many times
before one understands the root-cause. Part of the power of DTrace
is that it allows quick iterative analysis: the output from one
DTrace enabling often raises a new question that leads to a new
DTrace enabling. It's very rare that one can walk up to a slow app,
run DTrace once, and have the answer -- so it's difficult to walk
you through a complete example without showing how the output of
one script prompted us to write the next.
That said, here's a relatively simple example that came up recently:
in a hallway conversation here in Menlo Park, an engineer tasked
with improving process start time wondered how many minor faults
a process typically takes between fork()'ing and exec()'ing.
The answer could guide improvements to the operating system. That
is, if the number of pages faulted in between a fork() and
exec() were very small, it might be worth considering a scheme
in which individual pagefaults become more expensive to reduce the
cost associated with a fork() or exec(). Historically,
we would have determined this primarily via thought experiment --
and if we had wanted to know a better answer we would have truss()'d
a few random processes (using the -m fltpage option to truss)
and crossed our fingers that the data was representative. And, if
we had wanted to know the exact answer, someone would have had to
generate a kernel with custom instrumentation -- an incredibly time-consuming
task requiring specialized expertise.
But with DTrace, we wrote the following script:
proc:::start
{
self->pf = 0;
}
vminfo:::as_fault
{
self->pf++;
}
proc:::exec
{
@[execname] = lquantize(self->pf, 0, 100);
self->pf = 0;
}
We ran this on our local production machines (NFS servers, build servers,
and the like). The output was a histogram of the number of pagefaults
between fork() and exec() for each application name.
The answer was somewhat surprising: most apps take ~30-45 page faults
between a fork() and an exec(). Here, for example, was
the histogram for ksh:
ksh
value ------------- Distribution ------------- count
28 | 0
29 |@@@ 2
30 | 0
31 |@@@@@@@@@@@@@ 8
32 |@@@ 2
33 |@@@@@@ 4
34 |@@@ 2
35 |@@@@@@@@ 5
36 | 0
This was higher than the engineer had anticipated, and the answer
immediately redirected his thinking -- saving Sun the cost of even
prototyping a solution to the wrong problem! The time from question
to answer was about five minutes, and the data was gathered from production
machines!
Here's another example that came up recently: as many people probably
know, some programs have reason at one time or another to close
all of their file descriptors. Many programs do this by simply iterating
over all file descriptors and closing them -- regardless of whether
they're open or not. Especially if RLIMIT_NOFILE is set high, this
can be painful. To allow applications to do this without quite so
much pain, we added closefrom(3C) to Solaris quite some time
ago. It recently occurred to someone to find applications that were
candidates for closefrom(3C), so we wrote the following D
script to find closes of long runs of file descriptors that return
EBADF:
#!/usr/sbin/dtrace -s
#pragma D option quiet
syscall::close:entry
{
self->arg = arg0;
}
syscall::close:return
/self->arg && errno == EBADF &&
(self->arg == self->last + 1 ||
self->arg == self->last - 1 || self->last == 0)/
{
self->run++;
self->last = self->arg;
self->arg = 0;
}
syscall::close:return
/self->arg/
{
self->arg = 0;
self->last = 0;
}
syscall:::entry
/!self->arg && self->run && self->run < 10/
{
self->run = 0;
self->last = 0;
}
syscall:::entry
/!self->arg && self->run/
{
@[execname] = lquantize(self->run, 0, 300, 10);
self->run = 0;
self->last = 0;
}
profile:::tick-1sec
{
printa(@);
}
Running this on our local production mail server (1000+ users) yielded
the following output:
mutt
value ------------- Distribution ------------- count
230 | 0
240 |@@@@@@@@@@@@@@@@@@@@ 2
250 |@@@@@@@@@@@@@@@@@@@@ 2
260 | 0
slocal
value ------------- Distribution ------------- count
240 | 0
250 |@@@@@@@@@@@@@@@@@@@@ 11
260 | 0
sendmail
value ------------- Distribution ------------- count
230 | 0
240 |@@@@@@@@ 6
250 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 22
260 | 0
Sendmail has since been changed to use closefrom(3C); the others
are under investigation. (And running the script on a build server
revealed that "javac" is also a candidate application!)
This is a good example of simple information that was essentially
impossible to gather (or certainly impossible to gather easily) on
production machines prior to DTrace.
Q: DTrace comes with its own new language, "D". Was a new language
necessary?
A: We don't view "D" as a totally new language; instead it's the
companion language to C for dynamic tracing. Over 90% of D's syntax
is simply ANSI C. To this familiar and stable base we added a small
set of enhancements, which generally divide into two major categories:
a program structure suitable for a tracing request rather than a
traditional linear program structure, and a set of primitives for
more advanced data types such as associative arrays and aggregations
so that D programmers don't have to implement data structures.
A tracing program is unlike a traditional program in that rather
than proceeding in some order defined by the programmer, the program
acts as a sort of symbiotic entity that executes based on the activity
of one or more other programs that are being traced. For this reason,
the D program structure uses a set of clauses that match probes
rather than traditional functions that call one another.
Many tracing programs are also very short-lived, perhaps even
typed on the command line, as the user quickly asks a series of
questions specific to a particular problem being investigated. For
this reason, we wanted to give our users easy access to sophisticated
data structures such as associative arrays and first-class manipulation
of strings. Traditional programming languages would have added a
lot of unnecessary overhead to these tasks. So we took as our base
the language of UNIX, C, and created a companion optimized for use
in tracing activities.
Q: How can a tool have a userland compiler, create kernel triggers,
take no resources when not used, and still be secure and well performing?
A: By being designed very carefully to meet those requirements
:). Seriously, though, this question highlights some of the key
problems we specifically set out to solve with DTrace and that we
had to overcome in order to make DTrace effective for use on production
systems.
The DTrace compiler operates in userland because it simply converts
requests from the high-level D programming language into a byte
code for a very small, simple virtual machine designed specifically
for DTrace. As instrumentation requests are loaded in the kernel,
it performs a set of static safety checks before enabling the instrumentation,
and other dynamic safety checks as virtual machine instructions
are emulated. These safety checks are implemented in the kernel
using the very simple byte code to ensure that an errant program,
broken compiler, or malicious user cannot damage the system using
DTrace.
Our forthcoming USENIX paper describes the safety issues in further
detail. In addition to safety considerations, DTrace makes use of
the new Solaris 10 Least Privilege technology to offer three separate
privileges that can be used to control how much information a user
can access with DTrace. These privilege bits permit developers to
be able to examine only their own processes, whereas systems administrators
can perform system-wide analysis. Together, DTrace's safety and
security mechanisms mean DTrace is safe for use in production.
DTrace consumes no resources when not in use because we made all
of its instrumentation dynamic: a system where DTrace is not being
used is effectively the same as a system where DTrace is not even
installed. Because all of the instrumentation is enabled dynamically
and discretely, and the D language allows you to express your questions
very concisely and filter data at the source, the overhead of applying
DTrace scales with your use and usually provides very low probe
effect. These benefits could not be achieved by previous static
tracing systems where a single global switch enabled tracing.
Q: What kinds of problems can DTrace help solve?
A: Just about anything that can be phrased as a question -- DTrace
allows you to answer pretty much any question you may have about
your system. For example, you might wish to know "what application
is causing all of the cross calls?" Answering this question -- which
has been historically very difficult to answer -- is as easy as
dtrace -n xcalls'{@[execname] = count()}'. And that's just
the beginning: often, one finds that the answer to one question
raises another question. With DTrace, it's very easy to quickly
iterate from one answer to the next question.
Q: What kinds of things does DTrace not solve? For example, getting
knowledge of the performance of network interfaces, storage interfaces,
and system busses has been, ahem, a "challenge" in previous Solaris
releases. Does DTrace help there?
A: Absolutely. Take network interfaces, for example. As you alluded
to, getting network performance numbers has traditionally been very
difficult with Solaris. With DTrace, getting the answer to a question
like "what applications are sending what network traffic?" is easy:
# dtrace -n tcp_wput:entry'{@[execname] = quantize(msgdsize(args[1]))}'
For example, some output of running the above on a mail and NFS server:
save
value ------------- Distribution ------------- count
-1 | 0
0 | 5
1 | 0
2 | 0
4 | 0
8 | 0
16 | 0
32 | 1
64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 54958
128 | 71
256 | 30
512 | 10
1024 | 8
2048 | 23
4096 | 98
8192 | 189
16384 | 5
32768 | 0
imapd
value ------------- Distribution ------------- count
-1 | 0
0 | 1269
1 | 0
2 |@ 2483
4 | 18
8 |@ 2149
16 |@@@@ 5892
32 |@@@@@ 7927
64 | 764
128 | 565
256 |@@@@@@@@@@ 13508
512 |@@@@ 5999
1024 |@@@@@@@@@ 12201
2048 | 47
4096 | 28
8192 | 395
16384 | 0
nfsd
value ------------- Distribution ------------- count
-1 | 0
0 | 1765
1 | 0
2 | 0
4 | 0
8 | 135
16 | 181
32 |@@@@@@@@@@@ 182137
64 |@@@@@@@@@@@@@@@@@@@@@ 323422
128 |@@@@@ 84998
256 | 9420
512 | 723
1024 | 2170
2048 | 191
4096 | 245
8192 | 1116
16384 | 617
32768 | 1662
65536 | 0
This is roughly as we would expect, but one surprise is that the backup
software ("save") seems to be sending small packets. Why is it doing
that? Is that a failure mode of some sort, or is that normal operation?
And is there an opportunity for optimization here? It's a good example
of how one DTrace enabling often raises questions just as it answers
them, and it also raises another important point: by itself, DTrace
does not solve performance problems. Rather, DTrace is a tool
by which these problems can be understood. With DTrace, performance
analysis becomes much less labor intensive, but much more brain intensive
-- and a hell of a lot more fun!
Closing Comments
DTrace is certainly an impressive and powerful tool. I've been
experimenting quite a bit with it, and it appears to meet the analysis
and debugging needs of kernel developers, application developers,
and systems administrators alike. The manual is available as part
of Software Express for Solaris, and at first glance is daunting.
However, it is very clearly written, in a tutorial plus reference
style. If you know C or another reasonable programming language,
it won't take much time to get up to speed and start reaping the
benefits of this greatly expanded visibility into the workings of
Solaris and the processes running on it.
I thank the Sun's Solaris Kernel Development team for taking the
time to provide this information about DTrace. For more information
about DTrace, see the manuals with Sun Express for Solaris. There
is also a discussion board about DTrace at:
http://www.sun.com/bigadmin/content/dtrace/
Also, the kernel engineers are giving a talk about DTrace at upcoming
Boston Usenix '04 conference (http://www.usenix.com/events/usenix04/).
I'll also be there teaching "Advanced Topics in Solaris Administration".
I hope to see you there.
Peter Baer Galvin (http://www.petergalvin.info) is the
Chief Technologist for Corporate Technologies (www.cptech.com),
a premier systems integrator and VAR. Before that, Peter was the
systems manager for Brown University's Computer Science Department.
He has written articles for Byte and other magazines, and
previously wrote Pete's Wicked World, the security column, and Pete's
Super Systems, the systems management column for Unix Insider
(http://www.unixinsider.com). Peter is coauthor of
the Operating Systems Concepts and Applied Operating
Systems Concepts textbooks. As a consultant and trainer, Peter
has taught tutorials and given talks on security and systems administration
worldwide. |