Observing I/O Behavior with the DTraceToolkit

Ryan Matteson

With the recent release of the Solaris 10 Operating System, Sun unleashed the Dynamic Tracing Facility (DTrace) on the world. Dynamic tracing can be used by administrators, developers, and quality engineering to track down reliability and performance problems, correlate disparate events, and observe applications and the Solaris kernel in ways that have never before been possible.

The DTrace framework works by integrating several providers into the Solaris kernel. Each provider contains a set of probes, which are locations in the kernel where actions (e.g., capture a stack trace, record an argument to a function, etc.) can be applied when the probe is triggered. Probes are enabled programmatically through libdtrace or through the dtrace(1m) command-line utility, which is preferred.

DTrace comes bundled with a language called "D". D is a scripting language that is well-suited for developing tools to observe application resources and system behavior. The DTraceToolkit is a collection of D scripts that includes tools to view the virtual memory subsystem, network utilization, CPU activity, locking behavior, and the I/O subsystem. In this article, I'll introduce the DTraceToolkit and demonstrate how to use the Toolkit to analyze I/O behavior on a Solaris server.

DTraceToolkit Installation

The DTraceToolkit is available as a compressed tar archive and can be installed to your favorite directory location with the wget and gtar utilities:

$ wget -q -O - http://www.brendangregg.com/ \
  DTraceToolkit-latest.tar.gz | gtar xz

All of the scripts in the DTraceToolkit are organized into directories that identify their purpose (e.g., Disk, Memory, etc.) and are linked into a Bin directory for easy access:

$ ls -la
total 304
drwxr-xr-x   34 matty  matty   1156 Jul 26 01:51 .
drwx------   12 matty  matty    408 Aug  7 21:23 ..
drwxr-xr-x   84 matty  matty   2856 Jul 26 01:43 Bin
drwxr-xr-x    9 matty  matty    306 Jun 28 00:24 Cpu
drwxr-xr-x    9 matty  matty    306 Jul 25 06:40 Disk
drwxr-xr-x   16 matty  matty    544 Jul 26 01:44 Docs
drwxr-xr-x    3 matty  matty    102 Jun 11 08:52 Extra
-rw-r--r--    1 matty  matty   2392 Jul 26 01:43 Guide
drwxr-xr-x    9 matty  matty    306 Jun 13 15:32 Kernel
lrwxr-xr-x    1 matty  matty     14 Aug  7 21:23 License -> Docs/cddl1.txt
drwxr-xr-x    4 matty  matty    136 May 16 03:32 Locks
drwxr-xr-x    3 matty  matty    102 Jun 13 08:36 Man
drwxr-xr-x   12 matty  matty    408 Jul 25 02:57 Mem
drwxr-xr-x   10 matty  matty    340 Jul 25 09:06 Net
drwxr-xr-x   25 matty  matty    850 Jul 25 13:36 Proc
lrwxr-xr-x    1 matty  matty      5 Aug  7 21:23 README -> Guide
drwxr-xr-x    7 matty  matty    238 Jul 25 11:05 System
drwxr-xr-x    4 matty  matty    136 May 15 04:16 User
-rw-r--r--    1 matty  matty     40 Jul 26 01:51 Version
drwxr-xr-x    4 matty  matty    136 Jul 26 01:41 Zones
-rwxr-xr-x    1 matty  matty   7143 Jul 23 07:51 dappprof
-rwxr-xr-x    1 matty  matty   7626 Jul 23 07:52 dapptrace
-rwxr-xr-x    1 matty  matty  13576 Jul 23 07:56 dexplorer
-rwxr-xr-x    1 matty  matty  13461 Jul 23 08:02 dtruss
-rwxr-xr-x    1 matty  matty   6224 Jul 23 08:03 dvmstat
-rwxr-xr-x    1 matty  matty   4356 Jul 23 08:04 errinfo
-rwxr-xr-x    1 matty  matty   4883 Jul 23 08:05 execsnoop
-rwxr-xr-x    1 matty  matty   4275 May 15 01:03 install
-rwxr-xr-x    1 matty  matty  11983 Jul 25 07:26 iosnoop
-rwxr-xr-x    1 matty  matty  10954 Jul 26 00:56 iotop
-rwxr-xr-x    1 matty  matty   6189 Jul 23 08:34 opensnoop
-rwxr-xr-x    1 matty  matty   6043 Jul 23 08:49 procsystime
-rwxr-xr-x    1 matty  matty   5483 Jul 23 21:46 rwsnoop
-rwxr-xr-x    1 matty  matty   6514 Jul 23 14:51 rwtop

A large number of the scripts contain a "-h" (help) option, which prints a short description of the available flags and options:

$ iopattern -h
USAGE: iopattern [-v] [-d device] [-f filename] [-m mount_point]
                 [interval [count]]
 
                -v              # print timestamp
                -d device       # instance name to snoop
                -f filename     # snoop this file only
                -m mount_point  # this FS only
   eg,
        iopattern         # default output, 1 second samples
        iopattern 10      # 10 second samples
        iopattern 5 12    # print 12 x 5 second samples
        iopattern -m /    # snoop events on filesystem / only

The README file in the top-level directory provides a description of each directory, and the Docs/Contents file contains a description of each script. Now that we know how the toolkit is laid out, let's put this bad boy to work!

Monitoring Physical I/O Activity

One of the biggest challenges we face as systems administrators is identifying I/O performance problems. I/O problems are often caused by improper tuning, over-utilized disk resources, and memory shortages. When I/O performance problems surface, they usually lead to application latency and reduced database throughput. The DTraceToolkit comes with the iosnoop script, which can be used to monitor physical I/O operations, measure I/O latency, and correlate physical I/O activity to individual processes:

$ iosnoop -e
DEVICE    UID   PID D    BLOCK   SIZE       COMM PATHNAME
dad1      100   522 R  2821088   8192       bash /usr/bin/su
md0       100   522 R  2821088   8192       bash /usr/bin/su
dad0      100   522 R   332144   8192         su /lib/libbsm.so.1
md0       100   522 R   332144   8192         su /lib/libbsm.so.1
dad1      100   522 R   288484   6144         su /lib/libmd5.so.1
md0       100   522 R   288484   6144         su /lib/libmd5.so.1
dad1      100   522 R   334944   8192         su /lib/libmp.so.2
md0       100   522 R   334944   8192         su /lib/libmp.so.2
dad1       25   527 R 10114400   8192   sendmail /usr/lib/sendmail
md0        25   527 R 10114400   8192   sendmail /usr/lib/sendmail
dad0       25   527 R 10114160   8192   sendmail /usr/lib/sendmail
md0        25   527 R 10114160   8192   sendmail /usr/lib/sendmail
dad1       25   527 R 10114448   8192   sendmail /usr/lib/sendmail
md0        25   527 R 10114448   8192   sendmail /usr/lib/sendmail

When the "-e" (print device name) is passed to iosnoop, the device name to which the I/O operation is targeted will be printed in column 1. The second and third columns contain the uid and process id of the process that issued the read or write operation. The fourth column contains an "R" for a read operation or a "W" for a write operation. The fifth column displays the starting logical block in the file system where data is being read or written, and the sixth column displays the size of the read or write operation. The seventh column displays the name of the executable that performed the read or write operation, and the last column contains the file name that is being read or written from.

When iosnoop is invoked with the "-o" (print disk delta time, us) option, the number of microseconds required to service each physical read and write operation will be displayed in the "DTIME" column:

$ iosnoop -e -o
DEVICE DTIME UID  PID D  BLOCK    SIZE  COMM PATHNAME
sd5    5555  101  377 W  6927040  8192  oracle /u03/oradata/proddb/redo03.log
sd6    8618  101  377 W  6927040  8192  oracle /u03/oradata/proddb/redo03.log
md120 10967  101  377 W  6927040  8192  oracle /u03/oradata/proddb/redo03.log
sd1   12300  101  460 R 14771248  8192  oracle /u01/app/oracle/product/ \
                                        10.0.2.0/bin/oracle
sd2   12712  101  460 R 14771248  8192  oracle /u01/app/oracle/product/ \
                                        10.0.2.0/bin/oracle
sd3    9229  101  460 W 21360480  8192  oracle /u02/oradata/proddb/temp01.dbf
sd4   12442  101  460 W 21360480  8192  oracle /u02/oradata/proddb/temp01.dbf
md110 13751  101  460 W 21360480  8192  oracle /u02/oradata/proddb/temp01.dbf

The iosnoop script will continuously display physical read and write operations for all processes on a system, which can often overwhelm a terminal screen on a busy server. To limit the output to a single process (possibly one that is causing problems), the process name can be passed to iosnoop's "-n" (this process name only) option:

$ iosnoop -a -n sshd
STRTIME             DEVICE MAJ MIN UID  PID D   BLOCK   SIZE PATHNAME ARGS
2005 Aug 28 16:41:50 dad1  136   8   0  750 W  3435040  3072 /var/adm/lastlog
                                                             /usr/lib/ssh/sshd
2005 Aug 28 16:41:50 dad1  136   8   0  750 W  2993203   512 <none>
                                                             /usr/lib/ssh/sshd
2005 Aug 28 16:41:50 dad1  136   8   0  750 W  5259776  8192 /var/adm/wtmpx
                                                             /usr/lib/ssh/sshd
2005 Aug 28 16:41:57 dad1  136   8   0  761 W  3435040  3072 /var/adm/lastlog
                                                             /usr/lib/ssh/sshd
2005 Aug 28 16:41:57 dad1  136   8   0  761 W  2993204  9216 <none>
                                                             /usr/lib/ssh/sshd
2005 Aug 28 16:42:02 dad1  136   8   0  770 W  3435040  3072 /var/adm/lastlog
                                                             /usr/lib/ssh/sshd
2005 Aug 28 16:42:02 dad1  136   8   0  770 W  2993222  1024 <none>
                                                             /usr/lib/ssh/sshd
2005 Aug 28 16:42:08 dad1  136   8   0  692 W  2832720  4096 /var/adm/utmpx
                                                             /usr/lib/ssh/sshd

The continuously scrolling display is useful for capturing data over long periods of time. When a point-in-time snapshot is required, the DTraceToolkit's iotop script can be used to view physical I/O activity in a "top"-like display:

$ iotop 2
2005 Aug 13 11:36:43,  load: 0.16,  disk_r:   96 Kb,  disk_w: 109890 Kb

  UID    PID   PPID CMD           DEVICE   MAJ MIN D         BYTES
    0      0      0 sched          dad0    136   7 W           512
    0      3      0 fsflush        dad0    136   7 W           512
    0      3      0 fsflush        dad1    136  15 W           512
    0      0      0 sched          dad1    136  15 W           512
    0      3      0 fsflush        dad1    136   8 W          3072
    0      3      0 fsflush         md0     85   0 W          3072
    0      3      0 fsflush        dad0    136   0 W          3072
  100    502    484 mkfile         dad0    136   7 W          3584
  100    502    484 mkfile         dad1    136  15 W          3584
    0    501    498 dtrace         dad1    136   8 R          8192
    0    501    498 dtrace         dad0    136   0 R          8192
    0    501    498 dtrace          md0     85   0 R         16384
  100    502    484 mkfile         dad1    136   8 R         16384
  100    502    484 mkfile         dad0    136   0 R         16384
  100    502    484 mkfile          md0     85   0 R         32768
  100    502    484 mkfile          md0     85   0 W      34881536
  100    502    484 mkfile         dad1    136   8 W      34881536
  100    502    484 mkfile         dad0    136   0 W      34881536

The output of iotop will be refreshed every five seconds by default. To adjust this interval, a positive integer value can be passed as an option to iotop. Because the iosnoop and iotop scripts were designed to monitor physical I/O activity, reads serviced by the page cache and operations targeted at pseudo-devices will not be displayed by iosnoop and iotop. The section "Monitoring Read and Write Activity" will describe two additional scripts that can be used to provide a holistic view of read and write activity, which can be useful for viewing accesses to pseudo-devices and cached data.

Monitoring Read and Write Activity

When a process needs to read from or write to a file or device, the process will invoke one of the kernel's read or write system calls. The read or write may occur to a file, pseudo-device, or to one of the physical devices attached to the system. The rwsnoop script can capture read and write activity, which can be helpful for monitoring I/O activity when file system caches are in use and to view accesses to pseudo-devices and virtual file systems that reside in memory. To view read and write operations along with the process that generated the request, the rwsnoop script can be invoked:

$ rwsnoop
  UID    PID CMD         D    BYTES  FILE
  101    379 oracle      W    16384  /u01/oradata/proddb/control01.ctl
  101    379 oracle      W    16384  /u01/oradata/proddb/control02.ctl
  101    379 oracle      W    16384  /u01/oradata/proddb/control03.ctl
  101    460 oracle      W  1048576  /u02/oradata/proddb/tbs01.dbf
  101    460 oracle      W  1048576  /u02/oradata/proddb/rbs01.dbf
  101    460 oracle      W  1048576  /u02/oradata/proddb/tbs01.dbf
  101    377 oracle      W     1024  /u03/oradata/proddb/redo03.log
  101    377 oracle      W     1024  /u03/oradata/proddb/redo03.log
  101    371 oracle      R      416  /proc/371/psinfo
  101    371 oracle      R      416  /proc/373/psinfo
  101    371 oracle      R      416  /proc/375/psinfo

Rwsnoop will display which process is performing the read or write operation, the size and type (read or write) of that operation, and the file or device the process is reading or writing to. Like iosnoop, the output will be continuously displayed on the terminal screen. To view a summary of read and write activity in a "top"-like display, the rwtop script can be used:

$ rwtop 2
2005 Jul 24 10:47:26,  load: 0.18,  app_r:      9 Kb,  app_w:    8 Kb

UID    PID   PPID CMD              D            BYTES
100    922    920 bash             R                3
100    922    920 bash             W               15
100    902    899 sshd             R             1223
100    926    922 ls               R             1267
100    902    899 sshd             W             1344
100    926    922 ls               W             2742
100    920    917 sshd             R             2946
100    920    917 sshd             W             4819
  0    404      1 vxsvc            R             5120

The screen will be refreshed every five seconds by default, which can be adjusted by passing a new refresh interval to rwtop.

I/O Size Distributions

As systems administrators, we are often asked to identify the size of I/O operations issued by each application. This information is used by systems and application administrators to tune file system block and cluster sizes, SCSI and LVM transfer sizes, and RAID 0 stripe widths to match application I/O patterns.

Before the introduction of the DTraceToolkit, the vxtrace(1m), prex(1m), and truss(1m) utilities were the best methods to find this information. The output from these utilities was often extremely verbose, and the process of correlating data and determining I/O size distributions was cumbersome and prone to error.

Now that the DTraceToolkit is available, this information can be easily determined by running the bitesize.d script. bitesize.d will determine I/O size distributions for all processes in a Solaris system and will print a histogram with the number of I/O operations displayed by size:

$ bitesize.d
  Sampling... Hit Ctrl-C to end.

  377  ora_lgwr_proddb

         value  ------------- Distribution ------------- count    
           256 |                                         0        
           512 |@@@@@@@@@@@@                             604      
          1024 |                                         2        
          2048 |                                         0        
          4096 |                                         2        
          8192 |@@@@@@@@@@@@@@@@@@@@@@@@@                1218     
         16384 |@@                                       102      
         32768 |                                         0        
         65536 |                                         0        
        131072 |                                         12       
        262144 |                                         0   

  7312  dbload pages=128

        value  ------------- Distribution ------------- count
            16 |                                         0
            32 |                                         2
            64 |                                         0
           128 |                                         0
           256 |                                         0
           512 |                                         2
          1024 |                                         0
          2048 |                                         0
          4096 |                                         0
          8192 |                                         0
         16384 |                                         0
         32768 |                                         0
         65536 |                                         0
        131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 76947
        262144 |                                         0

This display shows the output from the Oracle logwriter and a database load process. The histogram shows that the Oracle log writer was using predominantly 8k-16k I/O sizes. The database loader, on the other hand, was requesting 1-MB chunks (128 pages * 8k per page), but we can see from this display that the majority of the I/O operations were 128k-256k in size. This was caused by the md_maxphys value, which is the largest default I/O size that can be performed to an SVM volume.

I/O Access Patterns

When deploying new applications to Solaris systems, it is essential to understand how an application accesses data. Access patterns can be random or sequential and often vary in size. Understanding access patterns can help when tuning direct I/O and caching attributes and are invaluable for finding the optimal block sizes for applications.

Before the introduction of DTrace, an administrator was required to discuss access patterns with vendors and development teams or parse through mounds of truss(1m) and vxtrace(1m) data to estimate how applications access files and volumes. The data was often skewed, required a good deal of time to derive interesting metrics, and was often prone to error.

Now that DTrace is able to capture data throughout the kernel, the job of finding access patterns has been greatly simplified. The DTraceToolkit includes the iopattern and seeksize.d scripts, which can both be used to display per-process and system-wide access patterns. To get system-wide access patterns, the iopattern script can be used:

$ iopattern
%RAN %SEQ  COUNT    MIN    MAX    AVG     KR     KW
  95    5    101    512 131072  30649   2792    231
  60   40    174    512 131072  61557    951   9509
  58   42    187    512 131072  35369   2232   4227
  53   47    229    512 131072  65880   3992  10741
  42   58    170    512 131072  89798   5328   9580
  67   33    190    512 131072  52135   3864   5809
  80   20    195    512 131072   6516   1207     34
  84   16    224    512  24576   7860   1574    145
  89   11    199   1024   8192   7682   1481     12

Iopattern provides the percentage of random and sequential I/O, the number and total size of the I/O operations performed during this sample period, and it provides the minimum, maximum, and average I/O sizes. To get the I/O distribution for each process, the seeksize.d script can be used:

$ seeksize.d
     Sampling... Hit Ctrl-C to end.
     ^C

    PID  CMD
    7312  dd if=/dev/dsk/c1t1d0s2 of=/dev/null bs=1048576

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1762
               1 |                                         0

    PID  CMD
    377  ora_dbw0_proddb
           value  ------------- Distribution ------------- count    
        -8388608 |                                         0        
        -4194304 |                                         3        
        -2097152 |                                         28       
        -1048576 |                                         44       
         -524288 |                                         95       
         -262144 |                                         102      
         -131072 |                                         166      
          -65536 |@                                        363      
          -32768 |@@                                       585      
          -16384 |@                                        328      
           -8192 |                                         141      
           -4096 |                                         105      
           -2048 |                                         57       
           -1024 |                                         35       
            -512 |                                         34       
            -256 |                                         40       
            -128 |                                         31       
             -64 |                                         35       
             -32 |                                         47       
             -16 |                                         22       
              -8 |                                         6        
              -4 |                                         20       
              -2 |                                         0        
              -1 |                                         0        
               0 |@@@@@@@@@@@@                             4164     
               1 |                                         0        
               2 |@@@                                      926      
               4 |@                                        208      
               8 |@@@@@@                                   2148     
              16 |@                                        440      
              32 |@                                        410      
              64 |@                                        318      
             128 |@                                        225      
             256 |@                                        190      
             512 |                                         148      
            1024 |                                         139      
            2048 |                                         155      
            4096 |                                         168      
            8192 |@                                        183      
           16384 |@                                        377      
           32768 |@@                                       657      
           65536 |@                                        406      
          131072 |@                                        199      
          262144 |                                         151      
          524288 |                                         111      
         1048576 |                                         37       
         2097152 |                                         14       
         4194304 |                                         3        
         8388608 |                                         0

Seeksize.d measures the seek distance between consecutive reads and writes and provides a histogram with the number of sectors traveled between consecutive read or write operations. Applications that read or write data in a sequential fashion will contain a small distribution (e.g., dd in the example above). Applications that read or write data in a random fashion will cause the disk heads to seek between read and write operations, which will cause a wide distribution (e.g., the Oracle database writer in the example above) in the seeksize.d output.

Conclusions

With the continued growth in storage capacity and complexity, it is important to have tools to analyze and debug I/O problems. This article introduced several scripts in the DTraceToolkit that can be used to easily observe I/O behavior in ways that were previously difficult or impossible. Because Solaris 10, DTrace, and the DTraceToolkit are free to download and use, truly understanding your applications is just a few clicks away.

References

DTrace Documentation -- http://www.sun.com/bigadmin/content/dtrace

DTraceToolkit -- http://www.opensolaris.org/os/community/dtrace/dtracetoolkit

Solaris 10 -- http://www.sun.com/download

Acknowledgments

Ryan thanks Brendan Gregg for reviewing this article and writing the DTraceToolkit. Ryan would also like to thank the DTrace developers for their awesome contribution to system and application observation.

Ryan Matteson works as a systems engineer, and specializes in Web technologies, SANs, and the OpenBSD, Linux, and Solaris operating systems. When Ryan isn't busy working, he enjoys playing guitar and maintaining his blog at: daemons.net/~matty. Questions and comments about this article can be addressed to: matty@daemons.net.