Profiling SAN, NAS, and DAS I/O Stacks Using io_profile

Bill Pierce

Performance benchmarks are often regarded as one step above voodoo (see "Benchmarking Systems" by Henry Newman, Sys Admin, April 2003: http://www.samag.com/documents/s=7898/sam0304i/). With so much at stake, it's understandable that vendors want to isolate their products in environments and measure them in situations in which they will do best. But storage products don't operate in isolated environments. They are components of I/O stacks that will behave differently when used and tuned in combination with other components.

The systems being benchmarked are also far from isolated or static. Change the tuning parameters of one component and then try to predict how it will influence the characteristics of neighboring components. Add SAN traffic or fill the filesystem to 80% and it's difficult to predict how performance characteristics will change overall. Ultimately, what matters is not the behavior of individual components, but the performance that applications see from an I/O stack.

Benchmarks also tend to result in single numbers, for example, "Under these conditions, the average throughput was X". The word "average" is important here because in taking averages over important trends, we may be limiting our understanding of what's happening. In physics and other branches of engineering, we often look at the spectral responses of systems in the frequency domain to better understand phenomenon that appear unintuitive at first.

Consider the questions of why the sky is blue and sunsets are orange. The answer is that we are looking at two different ends of the same frequency-dependent scattering phenomenon. It's not until we study the dependency of scattering on frequency that we can form a basis on which we can answer these questions. Blue light scatters more readily than red light; the sky is the color of the light that was scattered; the sunset is the color of the light that is less readily scattered. By understanding the spectral behavior of the system, we understand the behavior of the system more deeply, allowing us to draw a broader range of inferences.

I/O Profiling

With the introduction of a variety of network storage and virtualization options, the path from the application's write call to the hard disk's head has grown varied and complex, with many layers in between receiving data and passing it on. Each of these layers has its own performance characteristics that may be dependent on data packetization (I/O size). Because each layer packetizes data in its own configuration-dependent way, certain packet sizes may be more efficiently processed. I/Os that fit cleanly within a layer's block size are probably transferred with less overhead than those that must be broken apart.

Likewise, an application may have its own spectral profile at which it makes reads and writes. For example, a database application may page data in 4-KB chunks, and/or may write small records 512 bytes in size.

Other applications, like file servers that host workgroups, may not have a characteristic size but may perform I/O across the spectrum. This possibility begs for a spectral analysis in which throughput is measured as a function of I/O size. Knowing this "I/O Profile", combined with some knowledge of the spectral profile of the application, allows conclusions to be drawn about how well a given I/O stack is tuned for that application's needs. A spectrum of data characterizes the complexity of an I/O stack much more fully than a point measurement at a given I/O size.

In some cases, the act of measuring the I/O profile may turn up some surprising results. For example, if you were asked "Which is a higher performance solution: RAID5 striped across 5 disks, or just a plain disk?", you'd probably assume that the stripe was faster. As the following results will show, this may not always be the case. So, not only was the question too simple, but the answer was wrong and you might have wasted $50K on a new RAID controller that did not provide the performance boost you expected. Too often, we rely on rules of thumb that are fine in isolation but are nullified as other, more complex interactions weigh in. Rather than trying to predict a complex system's behavior and being wrong, why not make it easy to measure and be right?

The io_profile utility that I will describe in this article is a tool for architecting, tuning, and monitoring the spectral performance of the I/O stacks of complex storage systems. Through measurement, we can take performance optimization from primitive voodoo to science.

How Does io_profile Differ from Existing Benchmarks?

Performance analysis and benchmarking tools have been around for some time, and there are a number of standard benchmarks available (e.g., bonnie, bonnie++, iometer, iozone, tiobench, hdparm, postmark). See http://www.acnc.com/benchmarks.html. Of these, only iozone incorporates concepts of spectral profiling. It is a very capable tool that can measure throughput as a surface function of file size and I/O size. Like most flexible tools, it is also a bit difficult to learn to use. As of version 3.152, it was only available for Unix platforms.

Some of these benchmarks use kernel counters to measure throughput. While it can be good to know how the operating system is measuring its own performance, what you most likely care about is what your applications are seeing. A key example of this is in measuring I/O operations per second. Just what is considered an I/O operation? How many bytes are there in one? Different operating systems use different metrics and the answers depend on what level of the I/O stack you are considering. Different layers take these I/O requests and either combine them to form larger request sizes or break them down and packetize them into smaller requests. The io_profile utility defines an I/O operation as a file open, a read or write of a certain number of bytes, and a file close. It works the same way regardless of operating system or storage device.

io_profile 1.0 has the following design goals:

It is easy to use.
It is easy to understand what it measures.
It works the same way on Windows, Solaris, and Linux systems.
It is a command-line utility, with easily parsed output.
It measures performance of the whole I/O stack from the application's perspective, not the kernel's perspective, and not a device's perspective.
It can be used on any volume with a filesystem, including NFS/CIFS mounts, USB drives, removable media, locally attached or SAN disks, managed volumes, even USB memory sticks.
It can be used to measure performance with and without the influence of the operating system's buffer cache.
It can be used to test the performance of a filesystem by operating on a single file or on a set of files organized in a filesystem tree.
It is open source and distributed under the GNU Public License.

Building and Installing io_profile

You can download io_profile from: http://www.teracloud.com/utilities.html. The io_profile utility is written in portable C and can be built on Windows, Solaris, and Linux with make and the most typical C compiler for each of these operating systems. If you don't like building your own executable, you can use a pre-built one in the bin directory of the distribution. Simply copy the binary where you want it and run it. The only dependencies are on the standard C and system libraries of each platform, so no additional DLLs or shared libraries are required. See the README file in the distribution for details.

Using io_profile to Measure Profiles

The io_profile utility was designed to be easy to use and simple in what it does. It has just a few command-line options and uses defaults if options are not specified. For example:

[bill@linus sda3]$ io_profile
IO profile test parameters
--------------------------
Date: Sun Mar  7 17:12:11 2004
Location: linus:/mnt/sda3
File mode: Multi-file filesystem, 3 levels, 4 files per level
Synch mode: Using libc with no buffering
File size: 4194304 bytes
Reads (Writes) from 2 to 65536 bytes
Repetitions per read (write) size: 1000

Pre-initializing the IO channel...
Done with Pre-initialization

Conducting write test...
I/O size [bytes], operations, time [s], throughput [bytes/s],I/O rate [1/s]
2,1000,0.078000,25641.025641,12820.512821
4,1000,0.176000,22727.272727,5681.818182
8,1000,0.075000,106666.666667,13333.333333
16,1000,0.073000,219178.082192,13698.630137
32,1000,0.072000,444444.444444,13888.888889
64,1000,0.073000,876712.328767,13698.630137
...
65536,983,1.037000,62123324.975892,947.926712
Done with write test

Conducting read test...
I/O size [bytes], operations, time [s], throughput [bytes/s], I/O rate [1/s]
2,1000,0.061000,32786.885246,16393.442623
4,1000,0.061000,65573.770492,16393.442623
8,1000,0.061000,131147.540984,16393.442623
16,1000,0.065000,246153.846154,15384.615385
32,1000,0.061000,524590.163934,16393.442623
64,1000,0.061000,1049180.327869,16393.442623
...
65536,985,0.864000,74714074.074074,1140.046296
Done with read test

After echoing the basic test parameters, io_profile checks for sufficient space and creates the file(s) it will use to conduct the tests. Then it performs some I/O to pre-initialize the I/O stack to try to reach a steady state before the measurements begin. io_profile writes test results and errors to STDOUT and STDERR, which you can redirect to a file for later analysis. The comma-separated format of the data is easy to import into your favorite spreadsheet or plotting tool. The output to STDOUT is five columns of data showing the I/O size, number of operations (reads or writes) performed, the time it took to perform those operations, the computed throughput, and I/O rate. Plotting I/O rate on a linear scale vs. I/O size on a logarithmic scale seems to be the best way to compare different profiles.

Here is what io_profile looks like in pseudocode:

check_for_space();
create_test_filesystem();
pre_initialize_io_stack();
do_io(write); do_io(read);
end;

function do_io(mode) {
for (io_size = min to max) {
    start_timer();
    for (operations = 1 to max )  {
        file = choose_random_file()
        open(file);
        seek_to_random_location(file);
        read_or_write_(file, mode, io_size);
        close(file);
    }
    time = stop_timer();
    compute_and_output_result();
}

io_profile has several command-line options that warrant some discussion. With the -s option, you can specify whether you want io_profile to operate in single- or multi-file mode. io_profile always operates in the current working directory. In multi-file mode, a small filesystem that is three levels deep with four files per level will be created. Each read or write chooses a file from this filesystem at random. The default number of files and levels can be changed in the source code by editing the io_profile_config.h file. If you specify single file mode, one file is created, and all reads and writes are done in it. Control over file mode allows you to best simulate the way your application writes data or test how different filesystems handle file and directory lookups.

The -m and -M options specify the domain of the profile scan from the smallest to the largest I/O size (in bytes). These can be decimal representations of powers of two, as io_profile scans by doubling the I/O size between each measurement. Defaults are between 2 and 64K bytes. Choose a range that is relevant to your application.

The -n option allows you to specify the number of I/Os performed at each I/O size. This governs how long your profiles will take to run and how much temporal averaging is performed. The default is 1000, and some coarse studies have shown that you need at least this many I/Os on a quiescent system for results to be repeatable.

The -l option allows you to specify the size of the test files that are used for the test. The default size is 4 MB. Because the I/O location within the file is chosen at random, you need a file size that is large enough to comfortably fit your largest read/write. Otherwise, you'll get too many misses as the I/Os hit the end of file.

The number of successful I/Os is shown in the second column of the results. Control over this parameter also allows you to test how well your filesystem handles large files in which storage blocks may be indirectly referenced multiple times. This, and the number of files in the test also control the width of the data space that is used relative to the amount of physical memory available to the operating system's buffer cache. For random I/O, buffer caches are less effective when this data space is larger than the cache.

The -b option uses OS-specific calls to bypass the buffer cache. It forces the application thread to block until each I/O (on Windows, writes only on Unix) has propagated from/to the storage device. This should allow you to measure the underlying performance of your I/O stack minus the influence of the operating system's buffer cache, if that is what you truly want to measure. Using this option would also more closely simulate the way databases that do their own caching would write to disk.

A few warnings about using io_profile -- like most benchmarks, it will put a load on the system and I/O stack (including the SAN) on which it is run. Thus, it would be best to run it on a non-production system or at a time in the production cycle when the loading will not harm production performance. In fact, because io_profile uses system clock time to measure time, loads on the system other than io_profile will impact the results. Also note that io_profile does not clean up the files it creates for the test. You can simply delete them after the test is complete.

Some Results from io_profile Testing

In this section, I'll examine and compare some I/O profiles measured with io_profile. The purpose is to point out some interesting results that were obtained with the tool, not to fully explain them.

SAN/NAS/DAS Comparison

When provisioning storage for a new application, a variety of options are open to us. Primary among these is the decision of whether to use a locally attached disk (DAS), SAN network storage, or a network-mounted filesystem (NAS). There might be a variety of reasons (availability, security, cost) for choosing one option over another, but let's examine the relative I/O profiles of each of these on the same machine. For these tests, I configured a Red Hat 8.0 Linux host to have five different storage options:

A RAID5 LUN on the SAN -- Through an Emulex LP8000 Fibre Channel Host Bus Adaptor through a single Brocade Fibre Channel switch to small Infortrend RAID array exporting a LUN that was a RAID 5 stripe (with stripe size 128KB) across four SCSI drives.
An NFS-mounted filesystem -- Mounted synchronously across our (10/100) LAN from a SUSE 9.1 Linux host configured as an NFS server.
An (LVM) Logical Volume on a single, direct-attached storage disk -- The disk was a Maxtor 86480D6 IDE drive.
A software RAID0 (md) stripe across two IDE disk partitions on two different disks -- Each disk was an older 500MB Western Digital AC2540H IDE drive with two 250-MB partitions used as subcomponents of a stripe and a mirror. The stripe had a chunk size of 64 KB.
A software RAID (md) mirror across two IDE disk partitions -- Same disks and partition sizes as above, but configured as a RAID1 mirror.

Each filesystem had a block size of 4 KB and was created on a logical device roughly 400 MB in size. The host had 256 MB of physical memory and by the end of the tests, 219 MB of that was used with 140 MB in cache. Tests were run with (Figure 1) and without (Figure 2) operating system buffer cache effects. The io_profile command that was used for the tests used all the default options, with -b -m 4096 used for the tests bypassing the buffer cache.

Only write performance results are shown in Figures 1 and 2 because of the omnipresence of OS buffer cache effects for reads under Linux. The first item to notice is the similar performance of all four block-storage options when buffer cache is used. The cache has a way of equalizing the performance of storage options when the data width fits neatly into it. Still, the SAN LUN seems to have higher performance than the other solutions, particularly at smaller I/O size, with somewhat poorer performance at larger I/O size.

Note the bimodal nature of the SAN trace, as if each measurement either belongs to an upper or a lower curve. Further trials indicated that this was probably what was happening, because the switching between curves was not reproducible. This was most likely due to the cache of the I/O stack operating more like a write-back cache or like a write-through cache as the cache flushed.

During the test, activity on the LEDs on the SAN switch and the array's disk drives was bursty even though the I/O requests driving the activity (io_profile) were continuous. Tests conducted while flushing was occurring appear to be on the lower curve. When I averaged over this cache flushing behavior by increasing the repetitions (-n option) to 10K, a curve between these bimodal curves resulted. Interestingly, this averaged curve has somewhat lower performance than the direct-attached solutions. Notice that the NFS solution had nearly constant performance until the I/O size reached 4 KB, above the size of an Ethernet frame (~1500 bytes), but well below the maximum size of an IP packet (64 KB).

With the effects of the operating system's buffer cache effectively removed (-b option), we can compare the relative performance of the underlying storage solutions alone in Figure2. Here, we see significant performance superiority of the SAN solution over the others, probably due to write caching performed by the array controller. Again, the array's drives would activate in bursts. Next, we see that the single Maxtor drive appeared to be faster than the older and smaller Western Digital drives, even when they were striped.

According to the manufacturer's specs on these drives, the Maxtor drive had an average seek time of 9.7 ms and a maximum throughput of 33 MB/s and the Western Digital drives had an average seek time of 11 ms and a maximum throughput of 13.3 MB/s. In this test, I/O profile was measuring a throughput of roughly 3.2 MB/s at the high end of the spectrum. Also notice the performance of the stripe winning over the mirror at smaller I/O sizes but not at the larger end of the spectrum. NFS holds its own at the lower end of the spectrum but drops off more rapidly than the block disk options at the higher end.

These results show not only the magnitude and nature of the influence of the buffer cache on the kind of performance that applications will see, but also the spectral differences in the different options. This is just the kind of information you need when you are architecting a storage solution and weighing the pros and cons of different solutions. As we saw for the SAN solution, interactions between OS and device caches may result in lower overall performance even with higher back-end performance. Rather than trying to predict these kinds of (sometimes counterintuitive) results, why not simply measure and find out?

Block Size Tuning

Consider the effect of filesystem block size. This is the fundamental size at which the filesystem allocates blocks of storage on top of which it stores file data. In theory, large block sizes would be more ideal for large, sequentially accessed files, but how large is the effect really? Figure 3 shows two profiles on a 400-MB partition of a SCSI drive under Linux using ext2 filesystems with block sizes of 1024 bytes and 4096 bytes. Measurements were taken on 10-MB files using bypass mode. The data suggest there is little difference in performance at smaller I/O sizes, but roughly a 15% increase is possible for larger writes. It is more difficult to draw conclusions about reads since it was not possible to isolate the effect of the buffer cache. However, the results suggested that much more significant gains on reads were possible.

Effect of Filesystem Fragmentation

Next, let's examine the effect of filesystem fragmentation on performance. It is well known that performance degrades as a filesystem becomes more full or fragmented. Let's monitor this degradation over time with io_profile. I created a 400-MB partition on an IDE drive and formatted that partition with NTFS, under Windows 2000(SP2), with a 1-KB cluster size. Then I created fragmentation on that volume by running a script that would copy and delete directory trees containing files of random length between 512 bytes and 2 MB in a round-robin fashion. This caused the filesystem to repeatedly allocate and de-allocate clusters for the files. Fragmentation was monitored using Windows 2000's native defragmentation tool. Writes in bypass mode with a small filesystem are shown, although the read results were similar.

The results are shown in Figure 4. The first test shows the I/O profile when the filesystem was empty. The second and third show when file fragmentation had reached 62% and 91%. As expected, performance decreased significantly from the empty filesystem but did not appear to be a strong function of fragmentation at higher levels of fragmentation. The decrease in performance was greater at larger I/O sizes, which would be expected since the filesystem must work harder to try to place contiguous blocks of data on the disk. At this point, I used the defragmentation tool to defragment the filesystem, which reduced file fragmentation to 51%. There was slightly better performance after defragmentation was employed, especially at lower I/O sizes.

Finally, I temporarily copied the filesystem's content to another volume, reformatted the volume, copied the content back, and re-ran the test. There was a significant performance increase back toward empty filesystem results, particularly at higher I/O sizes. Apparently, for large sequential I/O, it is better (from a performance standpoint) to back up, reformat, and restore your data than to rely on a defragmentation tool to restore performance to a fragmented filesystem.

Conclusion

The more I/O profiles I measured, the more surprises and interesting behaviors seemed to turn up. It was possible to verify some of my own assumptions and get a feel for the magnitude of performance differences between configurations, and being able to see these differences as a function of I/O size provided additional insights.

I've pointed out some interesting observations of my own, but don't take my word for it. Download the io_profile utility and try it on a few of your I/O stacks. Run it often for feedback as you experiment with different configurations and tuning parameters. Rather than trying to predict cause and effect and later find out your assumptions may have been wrong -- or at least oversimplified -- why not measure and find out the real answer? io_profile is an open source project, so if you'd like to contribute ideas, code, or testing environments, please contact me.

Author of the fcping utility (http://www.teracloud.com/utilities.html), Bill Pierce is another physicist turned software engineer. His goal is to develop storage administration tools that he would want to use as a sys admin. Bill started developing SAN management software in 1998 at Vixel Corp. and is currently a Senior Software Engineer at TeraCloud Corp. Bill can be contacted at: systems_r_up@yahoo.com.