Profiling
SAN, NAS, and DAS I/O Stacks Using io_profile
Bill Pierce
Performance benchmarks are often regarded as one step above voodoo
(see "Benchmarking Systems" by Henry Newman, Sys Admin, April
2003: http://www.samag.com/documents/s=7898/sam0304i/). With
so much at stake, it's understandable that vendors want to isolate
their products in environments and measure them in situations in
which they will do best. But storage products don't operate in isolated
environments. They are components of I/O stacks that will behave
differently when used and tuned in combination with other components.
The systems being benchmarked are also far from isolated or static.
Change the tuning parameters of one component and then try to predict
how it will influence the characteristics of neighboring components.
Add SAN traffic or fill the filesystem to 80% and it's difficult
to predict how performance characteristics will change overall.
Ultimately, what matters is not the behavior of individual components,
but the performance that applications see from an I/O stack.
Benchmarks also tend to result in single numbers, for example,
"Under these conditions, the average throughput was X". The word
"average" is important here because in taking averages over important
trends, we may be limiting our understanding of what's happening.
In physics and other branches of engineering, we often look at the
spectral responses of systems in the frequency domain to better
understand phenomenon that appear unintuitive at first.
Consider the questions of why the sky is blue and sunsets are
orange. The answer is that we are looking at two different ends
of the same frequency-dependent scattering phenomenon. It's not
until we study the dependency of scattering on frequency that we
can form a basis on which we can answer these questions. Blue light
scatters more readily than red light; the sky is the color of the
light that was scattered; the sunset is the color of the light that
is less readily scattered. By understanding the spectral behavior
of the system, we understand the behavior of the system more deeply,
allowing us to draw a broader range of inferences.
I/O Profiling
With the introduction of a variety of network storage and virtualization
options, the path from the application's write call to the hard
disk's head has grown varied and complex, with many layers in between
receiving data and passing it on. Each of these layers has its own
performance characteristics that may be dependent on data packetization
(I/O size). Because each layer packetizes data in its own configuration-dependent
way, certain packet sizes may be more efficiently processed. I/Os
that fit cleanly within a layer's block size are probably transferred
with less overhead than those that must be broken apart.
Likewise, an application may have its own spectral profile at
which it makes reads and writes. For example, a database application
may page data in 4-KB chunks, and/or may write small records 512
bytes in size.
Other applications, like file servers that host workgroups, may
not have a characteristic size but may perform I/O across the spectrum.
This possibility begs for a spectral analysis in which throughput
is measured as a function of I/O size. Knowing this "I/O Profile",
combined with some knowledge of the spectral profile of the application,
allows conclusions to be drawn about how well a given I/O stack
is tuned for that application's needs. A spectrum of data characterizes
the complexity of an I/O stack much more fully than a point measurement
at a given I/O size.
In some cases, the act of measuring the I/O profile may turn up
some surprising results. For example, if you were asked "Which is
a higher performance solution: RAID5 striped across 5 disks, or
just a plain disk?", you'd probably assume that the stripe was faster.
As the following results will show, this may not always be the case.
So, not only was the question too simple, but the answer was wrong
and you might have wasted $50K on a new RAID controller that did
not provide the performance boost you expected. Too often, we rely
on rules of thumb that are fine in isolation but are nullified as
other, more complex interactions weigh in. Rather than trying to
predict a complex system's behavior and being wrong, why not make
it easy to measure and be right?
The io_profile utility that I will describe in this article is
a tool for architecting, tuning, and monitoring the spectral performance
of the I/O stacks of complex storage systems. Through measurement,
we can take performance optimization from primitive voodoo to science.
How Does io_profile Differ from Existing Benchmarks?
Performance analysis and benchmarking tools have been around for
some time, and there are a number of standard benchmarks available
(e.g., bonnie, bonnie++, iometer, iozone, tiobench, hdparm, postmark).
See http://www.acnc.com/benchmarks.html. Of these, only iozone
incorporates concepts of spectral profiling. It is a very capable
tool that can measure throughput as a surface function of file size
and I/O size. Like most flexible tools, it is also a bit difficult
to learn to use. As of version 3.152, it was only available for
Unix platforms.
Some of these benchmarks use kernel counters to measure throughput.
While it can be good to know how the operating system is measuring
its own performance, what you most likely care about is what your
applications are seeing. A key example of this is in measuring I/O
operations per second. Just what is considered an I/O operation?
How many bytes are there in one? Different operating systems use
different metrics and the answers depend on what level of the I/O
stack you are considering. Different layers take these I/O requests
and either combine them to form larger request sizes or break them
down and packetize them into smaller requests. The io_profile utility
defines an I/O operation as a file open, a read or write of a certain
number of bytes, and a file close. It works the same way regardless
of operating system or storage device.
io_profile 1.0 has the following design goals:
- It is easy to use.
- It is easy to understand what it measures.
- It works the same way on Windows, Solaris, and Linux systems.
- It is a command-line utility, with easily parsed output.
- It measures performance of the whole I/O stack from the application's
perspective, not the kernel's perspective, and not a device's
perspective.
- It can be used on any volume with a filesystem, including NFS/CIFS
mounts, USB drives, removable media, locally attached or SAN disks,
managed volumes, even USB memory sticks.
- It can be used to measure performance with and without the
influence of the operating system's buffer cache.
- It can be used to test the performance of a filesystem by operating
on a single file or on a set of files organized in a filesystem
tree.
- It is open source and distributed under the GNU Public License.
Building and Installing io_profile
You can download io_profile from: http://www.teracloud.com/utilities.html.
The io_profile utility is written in portable C and can be built
on Windows, Solaris, and Linux with make and the most typical
C compiler for each of these operating systems. If you don't like
building your own executable, you can use a pre-built one in the
bin directory of the distribution. Simply copy the binary where
you want it and run it. The only dependencies are on the standard
C and system libraries of each platform, so no additional DLLs or
shared libraries are required. See the README file in the distribution
for details.
Using io_profile to Measure Profiles
The io_profile utility was designed to be easy to use and simple
in what it does. It has just a few command-line options and uses
defaults if options are not specified. For example:
[bill@linus sda3]$ io_profile
IO profile test parameters
--------------------------
Date: Sun Mar 7 17:12:11 2004
Location: linus:/mnt/sda3
File mode: Multi-file filesystem, 3 levels, 4 files per level
Synch mode: Using libc with no buffering
File size: 4194304 bytes
Reads (Writes) from 2 to 65536 bytes
Repetitions per read (write) size: 1000
Pre-initializing the IO channel...
Done with Pre-initialization
Conducting write test...
I/O size [bytes], operations, time [s], throughput [bytes/s],I/O rate [1/s]
2,1000,0.078000,25641.025641,12820.512821
4,1000,0.176000,22727.272727,5681.818182
8,1000,0.075000,106666.666667,13333.333333
16,1000,0.073000,219178.082192,13698.630137
32,1000,0.072000,444444.444444,13888.888889
64,1000,0.073000,876712.328767,13698.630137
...
65536,983,1.037000,62123324.975892,947.926712
Done with write test
Conducting read test...
I/O size [bytes], operations, time [s], throughput [bytes/s], I/O rate [1/s]
2,1000,0.061000,32786.885246,16393.442623
4,1000,0.061000,65573.770492,16393.442623
8,1000,0.061000,131147.540984,16393.442623
16,1000,0.065000,246153.846154,15384.615385
32,1000,0.061000,524590.163934,16393.442623
64,1000,0.061000,1049180.327869,16393.442623
...
65536,985,0.864000,74714074.074074,1140.046296
Done with read test
After echoing the basic test parameters, io_profile checks for sufficient
space and creates the file(s) it will use to conduct the tests. Then
it performs some I/O to pre-initialize the I/O stack to try to reach
a steady state before the measurements begin. io_profile writes test
results and errors to STDOUT and STDERR, which you can redirect to
a file for later analysis. The comma-separated format of the data
is easy to import into your favorite spreadsheet or plotting tool.
The output to STDOUT is five columns of data showing the I/O size,
number of operations (reads or writes) performed, the time it took
to perform those operations, the computed throughput, and I/O rate.
Plotting I/O rate on a linear scale vs. I/O size on a logarithmic
scale seems to be the best way to compare different profiles.
Here is what io_profile looks like in pseudocode:
check_for_space();
create_test_filesystem();
pre_initialize_io_stack();
do_io(write); do_io(read);
end;
function do_io(mode) {
for (io_size = min to max) {
start_timer();
for (operations = 1 to max ) {
file = choose_random_file()
open(file);
seek_to_random_location(file);
read_or_write_(file, mode, io_size);
close(file);
}
time = stop_timer();
compute_and_output_result();
}
io_profile has several command-line options that warrant some discussion.
With the -s option, you can specify whether you want io_profile
to operate in single- or multi-file mode. io_profile always operates
in the current working directory. In multi-file mode, a small filesystem
that is three levels deep with four files per level will be created.
Each read or write chooses a file from this filesystem at random.
The default number of files and levels can be changed in the source
code by editing the io_profile_config.h file. If you specify single
file mode, one file is created, and all reads and writes are done
in it. Control over file mode allows you to best simulate the way
your application writes data or test how different filesystems handle
file and directory lookups.
The -m and -M options specify the domain of the
profile scan from the smallest to the largest I/O size (in bytes).
These can be decimal representations of powers of two, as io_profile
scans by doubling the I/O size between each measurement. Defaults
are between 2 and 64K bytes. Choose a range that is relevant to
your application.
The -n option allows you to specify the number of I/Os
performed at each I/O size. This governs how long your profiles
will take to run and how much temporal averaging is performed. The
default is 1000, and some coarse studies have shown that you need
at least this many I/Os on a quiescent system for results to be
repeatable.
The -l option allows you to specify the size of the test
files that are used for the test. The default size is 4 MB. Because
the I/O location within the file is chosen at random, you need a
file size that is large enough to comfortably fit your largest read/write.
Otherwise, you'll get too many misses as the I/Os hit the end of
file.
The number of successful I/Os is shown in the second column of
the results. Control over this parameter also allows you to test
how well your filesystem handles large files in which storage blocks
may be indirectly referenced multiple times. This, and the number
of files in the test also control the width of the data space that
is used relative to the amount of physical memory available to the
operating system's buffer cache. For random I/O, buffer caches are
less effective when this data space is larger than the cache.
The -b option uses OS-specific calls to bypass the buffer
cache. It forces the application thread to block until each I/O
(on Windows, writes only on Unix) has propagated from/to the storage
device. This should allow you to measure the underlying performance
of your I/O stack minus the influence of the operating system's
buffer cache, if that is what you truly want to measure. Using this
option would also more closely simulate the way databases that do
their own caching would write to disk.
A few warnings about using io_profile -- like most benchmarks,
it will put a load on the system and I/O stack (including the SAN)
on which it is run. Thus, it would be best to run it on a non-production
system or at a time in the production cycle when the loading will
not harm production performance. In fact, because io_profile uses
system clock time to measure time, loads on the system other than
io_profile will impact the results. Also note that io_profile does
not clean up the files it creates for the test. You can simply delete
them after the test is complete.
Some Results from io_profile Testing
In this section, I'll examine and compare some I/O profiles measured
with io_profile. The purpose is to point out some interesting results
that were obtained with the tool, not to fully explain them.
SAN/NAS/DAS Comparison
When provisioning storage for a new application, a variety of
options are open to us. Primary among these is the decision of whether
to use a locally attached disk (DAS), SAN network storage, or a
network-mounted filesystem (NAS). There might be a variety of reasons
(availability, security, cost) for choosing one option over another,
but let's examine the relative I/O profiles of each of these on
the same machine. For these tests, I configured a Red Hat 8.0 Linux
host to have five different storage options:
- A RAID5 LUN on the SAN -- Through an Emulex LP8000 Fibre Channel
Host Bus Adaptor through a single Brocade Fibre Channel switch
to small Infortrend RAID array exporting a LUN that was a RAID
5 stripe (with stripe size 128KB) across four SCSI drives.
- An NFS-mounted filesystem -- Mounted synchronously across our
(10/100) LAN from a SUSE 9.1 Linux host configured as an NFS server.
- An (LVM) Logical Volume on a single, direct-attached storage
disk -- The disk was a Maxtor 86480D6 IDE drive.
- A software RAID0 (md) stripe across two IDE disk partitions
on two different disks -- Each disk was an older 500MB Western
Digital AC2540H IDE drive with two 250-MB partitions used as subcomponents
of a stripe and a mirror. The stripe had a chunk size of 64 KB.
- A software RAID (md) mirror across two IDE disk partitions
-- Same disks and partition sizes as above, but configured as
a RAID1 mirror.
Each filesystem had a block size of 4 KB and was created on a
logical device roughly 400 MB in size. The host had 256 MB of physical
memory and by the end of the tests, 219 MB of that was used with
140 MB in cache. Tests were run with (Figure 1) and without (Figure
2) operating system buffer cache effects. The io_profile
command that was used for the tests used all the default options,
with -b -m 4096 used for the tests bypassing the buffer cache.
Only write performance results are shown in Figures 1 and 2 because
of the omnipresence of OS buffer cache effects for reads under Linux.
The first item to notice is the similar performance of all four
block-storage options when buffer cache is used. The cache has a
way of equalizing the performance of storage options when the data
width fits neatly into it. Still, the SAN LUN seems to have higher
performance than the other solutions, particularly at smaller I/O
size, with somewhat poorer performance at larger I/O size.
Note the bimodal nature of the SAN trace, as if each measurement
either belongs to an upper or a lower curve. Further trials indicated
that this was probably what was happening, because the switching
between curves was not reproducible. This was most likely due to
the cache of the I/O stack operating more like a write-back cache
or like a write-through cache as the cache flushed.
During the test, activity on the LEDs on the SAN switch and the
array's disk drives was bursty even though the I/O requests driving
the activity (io_profile) were continuous. Tests conducted while
flushing was occurring appear to be on the lower curve. When I averaged
over this cache flushing behavior by increasing the repetitions
(-n option) to 10K, a curve between these bimodal curves
resulted. Interestingly, this averaged curve has somewhat lower
performance than the direct-attached solutions. Notice that the
NFS solution had nearly constant performance until the I/O size
reached 4 KB, above the size of an Ethernet frame (~1500 bytes),
but well below the maximum size of an IP packet (64 KB).
With the effects of the operating system's buffer cache effectively
removed (-b option), we can compare the relative performance
of the underlying storage solutions alone in Figure2. Here, we see
significant performance superiority of the SAN solution over the
others, probably due to write caching performed by the array controller.
Again, the array's drives would activate in bursts. Next, we see
that the single Maxtor drive appeared to be faster than the older
and smaller Western Digital drives, even when they were striped.
According to the manufacturer's specs on these drives, the Maxtor
drive had an average seek time of 9.7 ms and a maximum throughput
of 33 MB/s and the Western Digital drives had an average seek time
of 11 ms and a maximum throughput of 13.3 MB/s. In this test, I/O
profile was measuring a throughput of roughly 3.2 MB/s at the high
end of the spectrum. Also notice the performance of the stripe winning
over the mirror at smaller I/O sizes but not at the larger end of
the spectrum. NFS holds its own at the lower end of the spectrum
but drops off more rapidly than the block disk options at the higher
end.
These results show not only the magnitude and nature of the influence
of the buffer cache on the kind of performance that applications
will see, but also the spectral differences in the different options.
This is just the kind of information you need when you are architecting
a storage solution and weighing the pros and cons of different solutions.
As we saw for the SAN solution, interactions between OS and device
caches may result in lower overall performance even with higher
back-end performance. Rather than trying to predict these kinds
of (sometimes counterintuitive) results, why not simply measure
and find out?
Block Size Tuning
Consider the effect of filesystem block size. This is the fundamental
size at which the filesystem allocates blocks of storage on top
of which it stores file data. In theory, large block sizes would
be more ideal for large, sequentially accessed files, but how large
is the effect really? Figure 3 shows two profiles on a 400-MB partition
of a SCSI drive under Linux using ext2 filesystems with block sizes
of 1024 bytes and 4096 bytes. Measurements were taken on 10-MB files
using bypass mode. The data suggest there is little difference in
performance at smaller I/O sizes, but roughly a 15% increase is
possible for larger writes. It is more difficult to draw conclusions
about reads since it was not possible to isolate the effect of the
buffer cache. However, the results suggested that much more significant
gains on reads were possible.
Effect of Filesystem Fragmentation
Next, let's examine the effect of filesystem fragmentation on
performance. It is well known that performance degrades as a filesystem
becomes more full or fragmented. Let's monitor this degradation
over time with io_profile. I created a 400-MB partition on an IDE
drive and formatted that partition with NTFS, under Windows 2000(SP2),
with a 1-KB cluster size. Then I created fragmentation on that volume
by running a script that would copy and delete directory trees containing
files of random length between 512 bytes and 2 MB in a round-robin
fashion. This caused the filesystem to repeatedly allocate and de-allocate
clusters for the files. Fragmentation was monitored using Windows
2000's native defragmentation tool. Writes in bypass mode with a
small filesystem are shown, although the read results were similar.
The results are shown in Figure 4. The first test shows the I/O
profile when the filesystem was empty. The second and third show
when file fragmentation had reached 62% and 91%. As expected, performance
decreased significantly from the empty filesystem but did not appear
to be a strong function of fragmentation at higher levels of fragmentation.
The decrease in performance was greater at larger I/O sizes, which
would be expected since the filesystem must work harder to try to
place contiguous blocks of data on the disk. At this point, I used
the defragmentation tool to defragment the filesystem, which reduced
file fragmentation to 51%. There was slightly better performance
after defragmentation was employed, especially at lower I/O sizes.
Finally, I temporarily copied the filesystem's content to another
volume, reformatted the volume, copied the content back, and re-ran
the test. There was a significant performance increase back toward
empty filesystem results, particularly at higher I/O sizes. Apparently,
for large sequential I/O, it is better (from a performance standpoint)
to back up, reformat, and restore your data than to rely on a defragmentation
tool to restore performance to a fragmented filesystem.
Conclusion
The more I/O profiles I measured, the more surprises and interesting
behaviors seemed to turn up. It was possible to verify some of my
own assumptions and get a feel for the magnitude of performance
differences between configurations, and being able to see these
differences as a function of I/O size provided additional insights.
I've pointed out some interesting observations of my own, but
don't take my word for it. Download the io_profile utility and try
it on a few of your I/O stacks. Run it often for feedback as you
experiment with different configurations and tuning parameters.
Rather than trying to predict cause and effect and later find out
your assumptions may have been wrong -- or at least oversimplified
-- why not measure and find out the real answer? io_profile is an
open source project, so if you'd like to contribute ideas, code,
or testing environments, please contact me.
Author of the fcping utility (http://www.teracloud.com/utilities.html),
Bill Pierce is another physicist turned software engineer. His goal
is to develop storage administration tools that he would want to
use as a sys admin. Bill started developing SAN management software
in 1998 at Vixel Corp. and is currently a Senior Software Engineer
at TeraCloud Corp. Bill can be contacted at: systems_r_up@yahoo.com. |