Dissecting
ATA RAID Options
Bryan J. Smith
Advanced Technology Attachment (ATA) is a basic, block transfer
interface. In recent ATA specifications (e.g., the UltraDMA modes
for parallel ATA or newer Serial ATA), this is little more than
a copy of a fixed range of data from primary storage (volatile memory)
to secondary storage (non-volatile ATA disk), or vice versa, using
direct memory access (DMA). There is minimal system processor involvement
in the transfer because the ATA controller arbitrates the DMA transfer
on the I/O bus (e.g., PCI) and the integrated device electronics
(IDE) of the ATA drive to copy data to and from system memory.
In a nutshell, it's fast, efficient, and direct, using a point-to-point
transfer between two fixed endpoints (assuming there is not a "master/slave"
setup in the case of parallel ATA). This is called non-blocking
I/O, which results in ATA having the throughput, or highest data
transfer rates (DTR), for desktops and workstations with reduced
latency times as compared with command, queued, and multiple-target
I/O approaches like SCSI.
Multiple Devices
Things start to get complicated when devices, or targets, are
added to the storage solution, whether for purposes of capacity,
performance, or redundancy. With ATA, adding more devices means
adding more points of attachment on the I/O bus, such as multiple
ATA channels spread over multiple PCI slots or separate buses. The
downside is potential I/O contention between the I/O controllers
(especially when on the same I/O bus). SCSI, on the other hand,
supports multiple attachments per I/O device, so the point of contention
shifts to the SCSI bus, away from the I/O bus.
Storage experts are still debating which approach is better --
the traditional SCSI approach of multiple targets per bus, or multiple
point-to-point ATA channels. Fortunately for most high-end systems,
there are ATA storage solutions that offer more than simply two
ATA channels per I/O controller. Some offer on-board processing
and memory for off-loading operations as well as command queuing
and tagging when it is beneficial. These ATA solutions are more
on par with traditional SCSI I/O approaches than merely "dumb" ATA
controllers. Ultimately, as I will explain, it is the combination
of storage array design and the applications' needs that should
define the storage solution chosen.
ATA RAID Approaches
This article assumes you are familiar with the general concept
of Redundant Arrays of Independent Disks (RAID) and Just a Bunch
Of Disks (JBOD). Although there are endless ATA RAID products and
solutions on the market, they all boil down to four distinct approaches
(see Table 1), which I will describe in detail.
Logical Volume Management (LVM) -- The underlying logical volume
management (LVM) of Windows NT (2000/XP/2003), Linux's LVM, Sun's
Solstice Disk Suite and other solutions, possibly as an add-on (e.g.,
Veritas, IBM Tivoli, etc... products). Depending on the implementation,
the system may still be bootable from the firmware (e.g., PC BIOS)
without additional configuration (e.g., a copy of the OS bootstrap
with the required LVM drivers is on each disk in the array).
Free RAID (FRAID) Controller -- A standard ATA controller with
a 16-bit PC BIOS that adds basic RAID organization options in its
firmware, making the disks look like a single volume at boot-time
(before the OS loads). The disks are still accessed directly and
independently from the standpoint of system I/O. Once a 32/64-bit
OS loads, a driver is required to "organize" the disks into a volume
the OS can understand. This driver is where all RAID functionality
is located, so it is still a software RAID solution, driven by the
OS and, more importantly, the CPU of the system (not the controller).
FRAID is the most commonplace solution, because it is a "free" way
for a PC OEM to add RAID to any ATA controller -- both for on-mainboard
and in PCI add-on cards.
Buffering RAID Controller -- The traditional approach used by
SCSI RAID solutions. An on-board microcontroller (µC), which
is like a microprocessor, but typically not super-pipelined (Pentium
class or greater) and is designed for I/O with integrated peripherals
or capabilities, controls all transfers with the system I/O. RAID
volumes are addressed as a single (or multiple, as organized in
firmware) block storage device, and there is no means for the system
I/O to directly access the drives. An ample supply of on-board,
[Synchronous] Dynamic RAM ([S]DRAM) provides buffered I/O, allowing
operations to be extensively queued. There is an embedded OS on
the board, which is dedicated to RAID functionality.
Non-Blocking RAID Switch -- An ATA-centric approach used by at
least one major vendor. An on-board, application-specific integrated
circuit (ASIC) controls all transfers with the system I/O, as a
buffering controller, but is hardwired for its specific application.
Data transfers through the ASIC without delay, other than some basic
command-queuing for organization. This non-buffered approach to
I/O requires true 0 wait state memory, such as Static RAM (SRAM).
The trade-off is that SRAM is a combinational logic, not a simple
cell design like DRAM, so it is much larger in silicon area and
far more expensive per byte.
Note that Synchronous DRAM (DRAM) is still a DRAM technology with
40ns+ latencies. Using a synchronous DRAM clock, like with SDRAM,
merely limits the impact of these latencies for writes (e.g., it
may appear to the system processor that the data is committed within
a few clock cycles, but it actually takes dozens of cycles before
the DRAM cell is actually written to). For DRAM reads, there is
no way to avoid the dozens of cycles waiting, resulting in massive
latency, and ultimately bringing down the average DTR over a period
of time. DRAM latency is typically mitigated in a modern system
by using a small amount of 0 wait state RAM technology near the
CPU. Thus, a small amount of SRAM logic is typically embedded in
the modern processor (L1 and L2 cache) to keep immediate data readable.
The difference between LVM and FRAID can be confusing. However,
FRAID is no different from a "regular" ATA controller, except for
some added logic in its on-board 16-bit PC BIOS. The more advanced,
32/64-bit RAID logic required after the OS loads is in the driver,
and executes on the processor, just like an OS LVM implementation.
The data transfer is still point-to-point between the individual
drive and system memory, with the associated overhead (e.g., second
copy of data for RAID-1/mirroring, discussed later) for the RAID
operations on the system's I/O. In some cases, the CPU is also utilized
for FRAID operations (e.g., XOR operations for RAID-5, also discussed
later).
In "intelligent" hardware RAID, on-board controller and RAM can
off-load the burden of RAID overhead from the system's processor
and interconnect. The data transfer occurs between the system memory
and on-board controller, in a single data stream regardless of the
RAID level of the volume. The on-board controller handles the point-to-point
connections to the drives, independent of the rest of the system.
In the case of the µC+DRAM approach, the data writes are always
buffered in the local, on-board RAM first. In the case of the ASIC+SRAM
approach, an "adaptive cut-through" approach (directly to the disks,
if the disks are not busy) can be offered alongside the traditional
"store and forward" buffering approach.
Note that the use of common OSI layer-2/IEEE802 switch product
terminology here is deliberate because Ethernet switches use the
same ASIC+SRAM design as "Storage switches." The only difference
is the use of PHY interface chips for the Ethernet media channel
access instead of ATA controllers for drive channels.
RAID-0
As shown in Table 2, RAID-0 striping can theoretically offer up
to n times (where n equals the number of disks) the data transfer
rate compared with JBOD -- a linear increase in transfer rate as
each disk is added, for both reads and writes. At the same time,
the amount of actual data to/from disk versus committed by the OS
is a 1:1 for both writes and reads versus JBOD. So, there is no
overhead disadvantage except for the lack of redundancy, including
destruction of the entire volume if any single disk is lost.
RAID-0 can be most ideally suited for software, rather than a
single add-on card solution. The LVM in most OS and software RAID
products offer expanded volume management over most hardware RAID-0
cards. From a performance standpoint, using two controllers on two
separate I/O buses and interleaving data between controllers in
a RAID-0 striping volume is best. With little more than a small
amount of command overhead, the additional system processor load
in software RAID-0 is negligible. (Note that separate I/O buses
mean separate PCI buses on a typical PC mainboard, not merely different
PCI slots. Current PC chipsets sporting multiple PCI buses include
AMD's 760MP series (Athlon MP) and Intel/ServerWorks' E7500/ServerSet
series (Xeon-SMP). AMD's new 64-bit platforms with HyperTransport
offer multiple PCI-X bus options and Intel's forthcoming PCI-Express
standard will as well.)
Given that most OS LVM implementations for RAID-0 on a system
volume are completely bootable -- negating the BIOS advantage offered
by a FRAID card -- LVM becomes the better software solution because
the OS knows how to organize its own commits directly to disks,
rather than going through a vendor's driver logic.
RAID-0 is also the first place to dismiss the option of going
with a buffered RAID controller for ATA. The interleaved operation
of RAID-0 is as near to a non-blocking transfer as possible. The
buffering of writes and, especially, reads, is a redundant operation
that the OS is more capable of doing in its own system memory for
ATA storage. Buffered RAID controllers for RAID-0 should be limited
to those solutions where dozens of disks are required, and they
are then implemented with a very high-end SCSI solution (e.g., StrongARM/XScale
microcontroller with three or more Ultra320 channels). For fewer
disks, a non-blocking RAID switch then becomes the most ideal hardware
approach. Like LVM, it will interleave transfers between its disks
and offer command-queuing and buffering as appropriate (largely
for reads), thereby reducing processor overhead.
RAID-1
As seen in Table 2, RAID-1 mirroring is where write overhead begins
to affect software approaches, the cost of redundancy, and fault
tolerance. There is essentially a 2:1 write overhead; in other words,
twice as much data is committed to disk as is written compared with
JBOD. Reads remain 1:1.
As illustrated in Figure 1, software RAID-1 requires twice as
much data transfer between the memory and I/O for write operations,
effectively tying up the system interconnect for twice as long while
offering no improved performance. This still occurs with the majority
of FRAID cards. To make matters worse, FRAID card performance can
really suffer if a "slave" device is added to a single (parallel)
ATA channel where there is an existing "master." Regardless of approach,
ATA performance requires a single device per channel for point-to-point,
non-blocking I/O. The SerialATA specification smartly avoids offering
even the option of a "slave".
RAID-1 is where intelligent hardware RAID begins its effect. Unlike
software RAID where two, mirrored I/O operations must occur, a single
data transfer occurs between system memory and the RAID controller.
The write overhead to the mirroring is essentially moved to the
controller, so the overhead now becomes 1:1 versus JBOD from the
standpoint of the system. Because mirrors are simply two copies
of the same data, the on-board controllers of intelligent ATA RAID
solutions are smart enough to interleave multiple reads from different
disks. Thus, they reduce disk-seek latency times and theoretically
increase read DTR 2:1 over JBOD. Some LVM and almost no FRAID cards
offer this performance enhancement. Like RAID-0, RAID-1 is also
as near to a non-blocking I/O operation as possible -- especially
for interleaved reads.
RAID-5
RAID-5 striping with striped parity is where the write efficiency
and overhead quickly become variable, depending on the implementation.
From a more finite standpoint, the read DTR is theoretically n-1,
as RAID-5 is equivalent to RAID-0 when reading, minus one disk (which
is the disk where the parity section of the current block being
read is located). Likewise, the efficiency is also n-1, as only
one additional disk (regardless of the number of disks total) is
required to maintain redundancy.
Note that RAID-3 and RAID-4 striping with dedicated parity are
additional options that use a dedicated parity disk, instead of
striping parity across all disks. RAID-3 is offered by the Promise
SuperTrak series, and is ideal for workstations where access is
less random and in more sequential bursts. RAID-4 is not offered
by any major solutions listed here, but is highlighted in some enterprise
network attached storage solutions like Network Appliance's (NetApp's)
WAFL RAID-integrated filesystem. NetApp markets RAID-4 as better
than RAID-5 for servicing NFS with large block sizes and large files,
which are written in sequential bursts.
Software RAID-5, including FRAID, should be a non-consideration
for systems where writes must occur, especially on system partitions
like those for temporary, logs, or swap space. Unless the system
is reading only from a RAID-5 array, and thus acting like little
more than a RAID-0 array (minus one disk), the strain on the system
interconnect with software RAID-5 results in a performance hit to
average write DTR that is far worse than with software RAID-1.
Besides the actual committal of data to disk, the following process
must occur for the parity in software RAID-5, as illustrated in
Figure 2:
1. CPU reads data from memory.
2. CPU calculates parity (XOR).
3. CPU writes parity to memory.
4. Memory commits parity to I/O controller.
5. I/O controller commits parity to disk.
Although the XOR operation used to compute the parity segment
of a RAID-5 volume is one of the fastest combinational logic circuits,
the process still must occur inline with the committal of data.
Atop of this delay, we have the read and write delay of memory in
the first three steps. This exponentially increases latency of the
write, resulting in a DTR that is two- or even threefold slower
than JBOD. At the same time, the entire system interconnect is tied
up, resulting in a mass inefficiency of the system's resources.
For example, RAIDCore markets its RAID-5 READ performance on its
4-8 channel (4-8 disk) FRAID cards as a better solution than Ultra320.
Indeed, the read performance of RAIDCore solutions are outstanding
and unparalleled, as the RAID-5 array acts like a virtual RAID-0
array (minus one disk) when reading, while its PCI-X capable I/O
provides outstanding bandwidth. But once the RAIDCore moves to RAID-5
writes or RAID-5 rebuilds, the system taxing is no different from
any other FRAID solution, because it is simply "dumb" ATA. Benchmarks
often do not show how much your system cannot accomplish while it
is busy with unnecessary and redundant storage data transfer, especially
for XOR operations that incur massive memory transfer.
Intelligent hardware RAID is required for decent RAID-5 performance
when frequent writes are the norm. RAID-5 is where non-blocking
I/O is less of a consideration, as the XOR operation will require
extensive buffering of write operations. If the number of writes
is excessive and random, a large buffer will be required. The small
cache sizes (e.g., 1-4MB) of ASIC+SRAM solutions are not sufficient,
and a traditional microcontroller solution with a large buffer (e.g.,
64MB+) should be used.
RAID-0+1 and 0+5
The highest performing RAID-0+1 (aka RAID-10) solution is the
use of two (or more) intelligent hardware RAID controllers, each
with their own RAID-1 (better yet, two striped RAID-0+1) volumes.
They should be placed on separate I/O busses, organized by the OS
into a single, RAID-0 volume under its LVM. The write overhead for
the RAID-1 portion is equivalent to JBOD from the system's interconnect
(off-loaded to the controller), while the write DTR is doubled (quadrupled
or more if the hardware volumes are RAID-0+1). Read DTR continues
to be "n" -- linearly increased by the number of disks total overall
regardless of the use of RAID-1 or RAID-0+1 at the hardware level
-- and redundancy is maintained on the collective and individual
hardware volume(s).
The same approach can also be used for RAID-0+5 (aka RAID-50),
which offers more efficiency (less cost for the same amount of usable
storage) than RAID-0+1. The redundancy continues to be the responsibility
of the hardware RAID, removing the overhead of XOR operations and
preventing the exponentially increased usage of the system interconnect.
Although RAID-0+5 write DTR will still vary, and will likely be
slower than RAID-0+5, the striping across two (or more) RAID-5 volumes
helps reduce the impact.
Product Analysis
Although FRAID is becoming commonplace on typical ATA mainboards,
it is typically not worth the bother of the OS configuration effort.
The operating system's inherent LVM capabilities for RAID-0 and
RAID-1 should be explored instead of FRAID. However, the ATA channels
of FRAID controllers can and should be utilized as "dumb" ATA channels
(i.e., don't use their integrated BIOS RAID setup). RAIDCore's 4-8
channel FRAID products integrate with the LVM of the underlying
platform (including Linux's LVM), so they are an ideal PCI-X hardware
selection when implementing a large number of disks in a software
RAID-0 or largely read-only RAID-5 solution (instead of purchasing
multiple, 2-channel, PCI ATA cards).
Intelligent ATA RAID leaves two remaining approaches:
- Adaptec 2400/2800 and Promise SuperTrak
- 3Ware Escalade
Both the Adaptec 2400/2800 and Promise SuperTrak use the 32-bit Intel
i960 µC series. The i960 legacy microcontroller is a woefully
underpowered solution for modern RAID, especially at the speeds Promise
is using (only the entry-level 33MHz on many SuperTrak models). Most
SCSI equivalents have moved to Intel's StrongARM or its XScale series
successor, which are far better for high-end solutions if cost is
no object. Still, the Adaptec 2400/2800 offers 4- to 8-channel (4-8
disk) solutions for RAID-5 -- far better than using software RAID-5.
On the non-blocking I/O side, there are 3Ware's Escalade "Storage
Switch" solutions -- the 7000 series for [parallel] ATA, and the
8000 series for SerialATA. Their 64-bit, non-blocking I/O PCI-ASIC
interconnect and 1-4MB of 0 wait state SRAM provide the fastest
RAID-0, RAID-1, and read-only RAID-5 performance, while completely
off-loading operations from the system interconnect with support
for up to 12-channels (12 disk) on some models. The disadvantage
of the Escalade is its small 1-4MB SRAM cache. This cache will quickly
overflow when extensive writes are made to a RAID-5 volume, effectively
killing the Escalade's average write DTR.
Linux Compatibility
The biggest problem with FRAID cards is their reliance on the
operating system driver. Without the OS driver, the volume's organization
cannot be understood. The OS will see the devices as dumb ATA device,
possibly corrupting any RAID volume if a driver is not loaded. In
the case of Linux, GPL drivers from the vendors are impossible,
because licensed, third-party code cannot be open sourced. The Linux
community has come up with its own, generic FRAID driver (ataraid.c),
which is then used by the mini-drivers for common FRAID products,
but use of these drivers is not recommended. They may not be compatible
with newer FRAID products or volumes shared with other operating
systems.
Unlike FRAID, intelligent ATA RAID puts the RAID logic in the
firmware of the controller, which is then executed by the on-board
µC or ASIC. Promise has released GPL Linux drivers for the
SuperTrak and continues to support the SuperTrak on Linux. The Adaptec
2400/2800 is an I2O-compliant card, so most admins have had success
in using the DPT I2O driver (dpt_i2o.c). 3Ware maintains its own
GPL driver (3w-xxxx.c), and it has been included in the stock Linux
kernel since version 2.2.15. As with any intelligent storage controller,
the 3Ware Escalade product's firmware, OS driver, and user-space
tool versions should always be matched as listed in their release
notes. Support of these commodity ATA RAID solutions on platforms
other than Windows or Linux is typically limited.
Conclusions
Ultimately, choosing between ASIC and traditional microcontroller
ATA RAID solutions rest with the perceived cost of devices. The
final question to ask may be whether use of cheaper ATA RAID devices
negates the waste of RAID-0+1 over RAID-5, with the added benefit
of being faster at writes in most applications.
Systems administrators may choose ASIC route if they believe RAID-0+1
is the better route, for little extra cost. 3Ware seems to be the
sole solution provider for this solution. At the high-end, two 8-
or 12-channel Escalade cards, each with RAID-0+1 volumes on dedicated
PCI buses provides an ideal, interleaved setup for using the operating
system's LVM to RAID-0 stripe into a combined volume. If the application
is largely read-only, RAID-0+5 is also a more cost efficient option
that will work well with 3Ware products. I've used a combination
of the two in the past, including RAID-0+1 and RAID-5 volumes on
the same 8- or 12-channel controller, plus RAID-0 striped over two
controllers.
An admin also may prefer RAID-5's disk efficiency, especially
with a large device count (dozens). When the number of writes is
extensive, the Adaptec 2400 becomes the best, entry-level ATA RAID-5
solution. Higher capacity or performing solutions should include
a multi-channel SCSI RAID solution with lots of DRAM (e.g., 256MB+)
and a fast microcontroller (e.g., Intel StrongARM or XScale series),
as the Adaptec 2400/2800's i960 is quickly outclassed. As with RAID-1,
OS LVM RAID-0 stripes across multiple RAID-5 volumes over multiple
controllers across multiple I/O buses will result in the best average
DTR.
I hope this article will help you make informed choices for your
application's storage needs. Its concepts should be readily applicable
to analyzing newer ATA RAID solutions as they become available.
Bryan J. Smith holds a BSCpE from UCF, 29 (and holding) IT/vendor
certifications, and more than 12 years of combined IT/engineering
experience. His storage expertise encompasses high-performance disk
arrays for engineering and financial applications on NetApp filers
as well as Solaris, Windows, and Linux servers. As of 2004, Bryan
enjoys engaging the biggest technology critics of all -- middle
school students -- as a math and science teacher. He lives near
Orlando with his wife, Lourdes.
|