Observing
I/O Behavior with the DTraceToolkit
Ryan Matteson
With the recent release of the Solaris 10 Operating System, Sun
unleashed the Dynamic Tracing Facility (DTrace) on the world. Dynamic
tracing can be used by administrators, developers, and quality engineering
to track down reliability and performance problems, correlate disparate
events, and observe applications and the Solaris kernel in ways
that have never before been possible.
The DTrace framework works by integrating several providers into
the Solaris kernel. Each provider contains a set of probes, which
are locations in the kernel where actions (e.g., capture a stack
trace, record an argument to a function, etc.) can be applied when
the probe is triggered. Probes are enabled programmatically through
libdtrace or through the dtrace(1m) command-line utility,
which is preferred.
DTrace comes bundled with a language called "D". D is a scripting
language that is well-suited for developing tools to observe application
resources and system behavior. The DTraceToolkit is a collection
of D scripts that includes tools to view the virtual memory subsystem,
network utilization, CPU activity, locking behavior, and the I/O
subsystem. In this article, I'll introduce the DTraceToolkit and
demonstrate how to use the Toolkit to analyze I/O behavior on a
Solaris server.
DTraceToolkit Installation
The DTraceToolkit is available as a compressed tar archive and
can be installed to your favorite directory location with the wget
and gtar utilities:
$ wget -q -O - http://www.brendangregg.com/ \
DTraceToolkit-latest.tar.gz | gtar xz
All of the scripts in the DTraceToolkit are organized into directories
that identify their purpose (e.g., Disk, Memory, etc.) and are linked
into a Bin directory for easy access:
$ ls -la
total 304
drwxr-xr-x 34 matty matty 1156 Jul 26 01:51 .
drwx------ 12 matty matty 408 Aug 7 21:23 ..
drwxr-xr-x 84 matty matty 2856 Jul 26 01:43 Bin
drwxr-xr-x 9 matty matty 306 Jun 28 00:24 Cpu
drwxr-xr-x 9 matty matty 306 Jul 25 06:40 Disk
drwxr-xr-x 16 matty matty 544 Jul 26 01:44 Docs
drwxr-xr-x 3 matty matty 102 Jun 11 08:52 Extra
-rw-r--r-- 1 matty matty 2392 Jul 26 01:43 Guide
drwxr-xr-x 9 matty matty 306 Jun 13 15:32 Kernel
lrwxr-xr-x 1 matty matty 14 Aug 7 21:23 License -> Docs/cddl1.txt
drwxr-xr-x 4 matty matty 136 May 16 03:32 Locks
drwxr-xr-x 3 matty matty 102 Jun 13 08:36 Man
drwxr-xr-x 12 matty matty 408 Jul 25 02:57 Mem
drwxr-xr-x 10 matty matty 340 Jul 25 09:06 Net
drwxr-xr-x 25 matty matty 850 Jul 25 13:36 Proc
lrwxr-xr-x 1 matty matty 5 Aug 7 21:23 README -> Guide
drwxr-xr-x 7 matty matty 238 Jul 25 11:05 System
drwxr-xr-x 4 matty matty 136 May 15 04:16 User
-rw-r--r-- 1 matty matty 40 Jul 26 01:51 Version
drwxr-xr-x 4 matty matty 136 Jul 26 01:41 Zones
-rwxr-xr-x 1 matty matty 7143 Jul 23 07:51 dappprof
-rwxr-xr-x 1 matty matty 7626 Jul 23 07:52 dapptrace
-rwxr-xr-x 1 matty matty 13576 Jul 23 07:56 dexplorer
-rwxr-xr-x 1 matty matty 13461 Jul 23 08:02 dtruss
-rwxr-xr-x 1 matty matty 6224 Jul 23 08:03 dvmstat
-rwxr-xr-x 1 matty matty 4356 Jul 23 08:04 errinfo
-rwxr-xr-x 1 matty matty 4883 Jul 23 08:05 execsnoop
-rwxr-xr-x 1 matty matty 4275 May 15 01:03 install
-rwxr-xr-x 1 matty matty 11983 Jul 25 07:26 iosnoop
-rwxr-xr-x 1 matty matty 10954 Jul 26 00:56 iotop
-rwxr-xr-x 1 matty matty 6189 Jul 23 08:34 opensnoop
-rwxr-xr-x 1 matty matty 6043 Jul 23 08:49 procsystime
-rwxr-xr-x 1 matty matty 5483 Jul 23 21:46 rwsnoop
-rwxr-xr-x 1 matty matty 6514 Jul 23 14:51 rwtop
A large number of the scripts contain a "-h" (help) option, which
prints a short description of the available flags and options:
$ iopattern -h
USAGE: iopattern [-v] [-d device] [-f filename] [-m mount_point]
[interval [count]]
-v # print timestamp
-d device # instance name to snoop
-f filename # snoop this file only
-m mount_point # this FS only
eg,
iopattern # default output, 1 second samples
iopattern 10 # 10 second samples
iopattern 5 12 # print 12 x 5 second samples
iopattern -m / # snoop events on filesystem / only
The README file in the top-level directory provides a description
of each directory, and the Docs/Contents file contains a description
of each script. Now that we know how the toolkit is laid out, let's
put this bad boy to work!
Monitoring Physical I/O Activity
One of the biggest challenges we face as systems administrators
is identifying I/O performance problems. I/O problems are often
caused by improper tuning, over-utilized disk resources, and memory
shortages. When I/O performance problems surface, they usually lead
to application latency and reduced database throughput. The DTraceToolkit
comes with the iosnoop script, which can be used to monitor physical
I/O operations, measure I/O latency, and correlate physical I/O
activity to individual processes:
$ iosnoop -e
DEVICE UID PID D BLOCK SIZE COMM PATHNAME
dad1 100 522 R 2821088 8192 bash /usr/bin/su
md0 100 522 R 2821088 8192 bash /usr/bin/su
dad0 100 522 R 332144 8192 su /lib/libbsm.so.1
md0 100 522 R 332144 8192 su /lib/libbsm.so.1
dad1 100 522 R 288484 6144 su /lib/libmd5.so.1
md0 100 522 R 288484 6144 su /lib/libmd5.so.1
dad1 100 522 R 334944 8192 su /lib/libmp.so.2
md0 100 522 R 334944 8192 su /lib/libmp.so.2
dad1 25 527 R 10114400 8192 sendmail /usr/lib/sendmail
md0 25 527 R 10114400 8192 sendmail /usr/lib/sendmail
dad0 25 527 R 10114160 8192 sendmail /usr/lib/sendmail
md0 25 527 R 10114160 8192 sendmail /usr/lib/sendmail
dad1 25 527 R 10114448 8192 sendmail /usr/lib/sendmail
md0 25 527 R 10114448 8192 sendmail /usr/lib/sendmail
When the "-e" (print device name) is passed to iosnoop, the device
name to which the I/O operation is targeted will be printed in column
1. The second and third columns contain the uid and process id of
the process that issued the read or write operation. The fourth column
contains an "R" for a read operation or a "W" for a write operation.
The fifth column displays the starting logical block in the file system
where data is being read or written, and the sixth column displays
the size of the read or write operation. The seventh column displays
the name of the executable that performed the read or write operation,
and the last column contains the file name that is being read or written
from.
When iosnoop is invoked with the "-o" (print disk delta time,
us) option, the number of microseconds required to service each
physical read and write operation will be displayed in the "DTIME"
column:
$ iosnoop -e -o
DEVICE DTIME UID PID D BLOCK SIZE COMM PATHNAME
sd5 5555 101 377 W 6927040 8192 oracle /u03/oradata/proddb/redo03.log
sd6 8618 101 377 W 6927040 8192 oracle /u03/oradata/proddb/redo03.log
md120 10967 101 377 W 6927040 8192 oracle /u03/oradata/proddb/redo03.log
sd1 12300 101 460 R 14771248 8192 oracle /u01/app/oracle/product/ \
10.0.2.0/bin/oracle
sd2 12712 101 460 R 14771248 8192 oracle /u01/app/oracle/product/ \
10.0.2.0/bin/oracle
sd3 9229 101 460 W 21360480 8192 oracle /u02/oradata/proddb/temp01.dbf
sd4 12442 101 460 W 21360480 8192 oracle /u02/oradata/proddb/temp01.dbf
md110 13751 101 460 W 21360480 8192 oracle /u02/oradata/proddb/temp01.dbf
The iosnoop script will continuously display physical read and write
operations for all processes on a system, which can often overwhelm
a terminal screen on a busy server. To limit the output to a single
process (possibly one that is causing problems), the process name
can be passed to iosnoop's "-n" (this process name only) option:
$ iosnoop -a -n sshd
STRTIME DEVICE MAJ MIN UID PID D BLOCK SIZE PATHNAME ARGS
2005 Aug 28 16:41:50 dad1 136 8 0 750 W 3435040 3072 /var/adm/lastlog
/usr/lib/ssh/sshd
2005 Aug 28 16:41:50 dad1 136 8 0 750 W 2993203 512 <none>
/usr/lib/ssh/sshd
2005 Aug 28 16:41:50 dad1 136 8 0 750 W 5259776 8192 /var/adm/wtmpx
/usr/lib/ssh/sshd
2005 Aug 28 16:41:57 dad1 136 8 0 761 W 3435040 3072 /var/adm/lastlog
/usr/lib/ssh/sshd
2005 Aug 28 16:41:57 dad1 136 8 0 761 W 2993204 9216 <none>
/usr/lib/ssh/sshd
2005 Aug 28 16:42:02 dad1 136 8 0 770 W 3435040 3072 /var/adm/lastlog
/usr/lib/ssh/sshd
2005 Aug 28 16:42:02 dad1 136 8 0 770 W 2993222 1024 <none>
/usr/lib/ssh/sshd
2005 Aug 28 16:42:08 dad1 136 8 0 692 W 2832720 4096 /var/adm/utmpx
/usr/lib/ssh/sshd
The continuously scrolling display is useful for capturing data over
long periods of time. When a point-in-time snapshot is required, the
DTraceToolkit's iotop script can be used to view physical I/O activity
in a "top"-like display:
$ iotop 2
2005 Aug 13 11:36:43, load: 0.16, disk_r: 96 Kb, disk_w: 109890 Kb
UID PID PPID CMD DEVICE MAJ MIN D BYTES
0 0 0 sched dad0 136 7 W 512
0 3 0 fsflush dad0 136 7 W 512
0 3 0 fsflush dad1 136 15 W 512
0 0 0 sched dad1 136 15 W 512
0 3 0 fsflush dad1 136 8 W 3072
0 3 0 fsflush md0 85 0 W 3072
0 3 0 fsflush dad0 136 0 W 3072
100 502 484 mkfile dad0 136 7 W 3584
100 502 484 mkfile dad1 136 15 W 3584
0 501 498 dtrace dad1 136 8 R 8192
0 501 498 dtrace dad0 136 0 R 8192
0 501 498 dtrace md0 85 0 R 16384
100 502 484 mkfile dad1 136 8 R 16384
100 502 484 mkfile dad0 136 0 R 16384
100 502 484 mkfile md0 85 0 R 32768
100 502 484 mkfile md0 85 0 W 34881536
100 502 484 mkfile dad1 136 8 W 34881536
100 502 484 mkfile dad0 136 0 W 34881536
The output of iotop will be refreshed every five seconds by default.
To adjust this interval, a positive integer value can be passed as
an option to iotop. Because the iosnoop and iotop scripts were designed
to monitor physical I/O activity, reads serviced by the page cache
and operations targeted at pseudo-devices will not be displayed by
iosnoop and iotop. The section "Monitoring Read and Write Activity"
will describe two additional scripts that can be used to provide a
holistic view of read and write activity, which can be useful for
viewing accesses to pseudo-devices and cached data.
Monitoring Read and Write Activity
When a process needs to read from or write to a file or device,
the process will invoke one of the kernel's read or write system
calls. The read or write may occur to a file, pseudo-device, or
to one of the physical devices attached to the system. The rwsnoop
script can capture read and write activity, which can be helpful
for monitoring I/O activity when file system caches are in use and
to view accesses to pseudo-devices and virtual file systems that
reside in memory. To view read and write operations along with the
process that generated the request, the rwsnoop script can be invoked:
$ rwsnoop
UID PID CMD D BYTES FILE
101 379 oracle W 16384 /u01/oradata/proddb/control01.ctl
101 379 oracle W 16384 /u01/oradata/proddb/control02.ctl
101 379 oracle W 16384 /u01/oradata/proddb/control03.ctl
101 460 oracle W 1048576 /u02/oradata/proddb/tbs01.dbf
101 460 oracle W 1048576 /u02/oradata/proddb/rbs01.dbf
101 460 oracle W 1048576 /u02/oradata/proddb/tbs01.dbf
101 377 oracle W 1024 /u03/oradata/proddb/redo03.log
101 377 oracle W 1024 /u03/oradata/proddb/redo03.log
101 371 oracle R 416 /proc/371/psinfo
101 371 oracle R 416 /proc/373/psinfo
101 371 oracle R 416 /proc/375/psinfo
Rwsnoop will display which process is performing the read or write
operation, the size and type (read or write) of that operation, and
the file or device the process is reading or writing to. Like iosnoop,
the output will be continuously displayed on the terminal screen.
To view a summary of read and write activity in a "top"-like display,
the rwtop script can be used:
$ rwtop 2
2005 Jul 24 10:47:26, load: 0.18, app_r: 9 Kb, app_w: 8 Kb
UID PID PPID CMD D BYTES
100 922 920 bash R 3
100 922 920 bash W 15
100 902 899 sshd R 1223
100 926 922 ls R 1267
100 902 899 sshd W 1344
100 926 922 ls W 2742
100 920 917 sshd R 2946
100 920 917 sshd W 4819
0 404 1 vxsvc R 5120
The screen will be refreshed every five seconds by default, which
can be adjusted by passing a new refresh interval to rwtop.
I/O Size Distributions
As systems administrators, we are often asked to identify the
size of I/O operations issued by each application. This information
is used by systems and application administrators to tune file system
block and cluster sizes, SCSI and LVM transfer sizes, and RAID 0
stripe widths to match application I/O patterns.
Before the introduction of the DTraceToolkit, the vxtrace(1m),
prex(1m), and truss(1m) utilities were the best methods
to find this information. The output from these utilities was often
extremely verbose, and the process of correlating data and determining
I/O size distributions was cumbersome and prone to error.
Now that the DTraceToolkit is available, this information can
be easily determined by running the bitesize.d script. bitesize.d
will determine I/O size distributions for all processes in a Solaris
system and will print a histogram with the number of I/O operations
displayed by size:
$ bitesize.d
Sampling... Hit Ctrl-C to end.
377 ora_lgwr_proddb
value ------------- Distribution ------------- count
256 | 0
512 |@@@@@@@@@@@@ 604
1024 | 2
2048 | 0
4096 | 2
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@ 1218
16384 |@@ 102
32768 | 0
65536 | 0
131072 | 12
262144 | 0
7312 dbload pages=128
value ------------- Distribution ------------- count
16 | 0
32 | 2
64 | 0
128 | 0
256 | 0
512 | 2
1024 | 0
2048 | 0
4096 | 0
8192 | 0
16384 | 0
32768 | 0
65536 | 0
131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 76947
262144 | 0
This display shows the output from the Oracle logwriter and a database
load process. The histogram shows that the Oracle log writer was using
predominantly 8k-16k I/O sizes. The database loader, on the other
hand, was requesting 1-MB chunks (128 pages * 8k per page), but we
can see from this display that the majority of the I/O operations
were 128k-256k in size. This was caused by the md_maxphys value, which
is the largest default I/O size that can be performed to an SVM volume.
I/O Access Patterns
When deploying new applications to Solaris systems, it is essential
to understand how an application accesses data. Access patterns
can be random or sequential and often vary in size. Understanding
access patterns can help when tuning direct I/O and caching attributes
and are invaluable for finding the optimal block sizes for applications.
Before the introduction of DTrace, an administrator was required
to discuss access patterns with vendors and development teams or
parse through mounds of truss(1m) and vxtrace(1m)
data to estimate how applications access files and volumes. The
data was often skewed, required a good deal of time to derive interesting
metrics, and was often prone to error.
Now that DTrace is able to capture data throughout the kernel,
the job of finding access patterns has been greatly simplified.
The DTraceToolkit includes the iopattern and seeksize.d scripts,
which can both be used to display per-process and system-wide access
patterns. To get system-wide access patterns, the iopattern script
can be used:
$ iopattern
%RAN %SEQ COUNT MIN MAX AVG KR KW
95 5 101 512 131072 30649 2792 231
60 40 174 512 131072 61557 951 9509
58 42 187 512 131072 35369 2232 4227
53 47 229 512 131072 65880 3992 10741
42 58 170 512 131072 89798 5328 9580
67 33 190 512 131072 52135 3864 5809
80 20 195 512 131072 6516 1207 34
84 16 224 512 24576 7860 1574 145
89 11 199 1024 8192 7682 1481 12
Iopattern provides the percentage of random and sequential I/O, the
number and total size of the I/O operations performed during this
sample period, and it provides the minimum, maximum, and average I/O
sizes. To get the I/O distribution for each process, the seeksize.d
script can be used:
$ seeksize.d
Sampling... Hit Ctrl-C to end.
^C
PID CMD
7312 dd if=/dev/dsk/c1t1d0s2 of=/dev/null bs=1048576
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1762
1 | 0
PID CMD
377 ora_dbw0_proddb
value ------------- Distribution ------------- count
-8388608 | 0
-4194304 | 3
-2097152 | 28
-1048576 | 44
-524288 | 95
-262144 | 102
-131072 | 166
-65536 |@ 363
-32768 |@@ 585
-16384 |@ 328
-8192 | 141
-4096 | 105
-2048 | 57
-1024 | 35
-512 | 34
-256 | 40
-128 | 31
-64 | 35
-32 | 47
-16 | 22
-8 | 6
-4 | 20
-2 | 0
-1 | 0
0 |@@@@@@@@@@@@ 4164
1 | 0
2 |@@@ 926
4 |@ 208
8 |@@@@@@ 2148
16 |@ 440
32 |@ 410
64 |@ 318
128 |@ 225
256 |@ 190
512 | 148
1024 | 139
2048 | 155
4096 | 168
8192 |@ 183
16384 |@ 377
32768 |@@ 657
65536 |@ 406
131072 |@ 199
262144 | 151
524288 | 111
1048576 | 37
2097152 | 14
4194304 | 3
8388608 | 0
Seeksize.d measures the seek distance between consecutive reads and
writes and provides a histogram with the number of sectors traveled
between consecutive read or write operations. Applications that read
or write data in a sequential fashion will contain a small distribution
(e.g., dd in the example above). Applications that read or write data
in a random fashion will cause the disk heads to seek between read
and write operations, which will cause a wide distribution (e.g.,
the Oracle database writer in the example above) in the seeksize.d
output.
Conclusions
With the continued growth in storage capacity and complexity,
it is important to have tools to analyze and debug I/O problems.
This article introduced several scripts in the DTraceToolkit that
can be used to easily observe I/O behavior in ways that were previously
difficult or impossible. Because Solaris 10, DTrace, and the DTraceToolkit
are free to download and use, truly understanding your applications
is just a few clicks away.
References
DTrace Documentation -- http://www.sun.com/bigadmin/content/dtrace
DTraceToolkit -- http://www.opensolaris.org/os/community/dtrace/dtracetoolkit
Solaris 10 -- http://www.sun.com/download
Acknowledgments
Ryan thanks Brendan Gregg for reviewing this article and writing
the DTraceToolkit. Ryan would also like to thank the DTrace developers
for their awesome contribution to system and application observation.
Ryan Matteson works as a systems engineer, and specializes
in Web technologies, SANs, and the OpenBSD, Linux, and Solaris operating
systems. When Ryan isn't busy working, he enjoys playing guitar
and maintaining his blog at: daemons.net/~matty. Questions
and comments about this article can be addressed to: matty@daemons.net. |