Article
Figure 1
Figure 2
Table 1
Table 2

apr2004.tar

Dissecting ATA RAID Options

Bryan J. Smith

Advanced Technology Attachment (ATA) is a basic, block transfer interface. In recent ATA specifications (e.g., the UltraDMA modes for parallel ATA or newer Serial ATA), this is little more than a copy of a fixed range of data from primary storage (volatile memory) to secondary storage (non-volatile ATA disk), or vice versa, using direct memory access (DMA). There is minimal system processor involvement in the transfer because the ATA controller arbitrates the DMA transfer on the I/O bus (e.g., PCI) and the integrated device electronics (IDE) of the ATA drive to copy data to and from system memory.

In a nutshell, it's fast, efficient, and direct, using a point-to-point transfer between two fixed endpoints (assuming there is not a "master/slave" setup in the case of parallel ATA). This is called non-blocking I/O, which results in ATA having the throughput, or highest data transfer rates (DTR), for desktops and workstations with reduced latency times as compared with command, queued, and multiple-target I/O approaches like SCSI.

Multiple Devices

Things start to get complicated when devices, or targets, are added to the storage solution, whether for purposes of capacity, performance, or redundancy. With ATA, adding more devices means adding more points of attachment on the I/O bus, such as multiple ATA channels spread over multiple PCI slots or separate buses. The downside is potential I/O contention between the I/O controllers (especially when on the same I/O bus). SCSI, on the other hand, supports multiple attachments per I/O device, so the point of contention shifts to the SCSI bus, away from the I/O bus.

Storage experts are still debating which approach is better -- the traditional SCSI approach of multiple targets per bus, or multiple point-to-point ATA channels. Fortunately for most high-end systems, there are ATA storage solutions that offer more than simply two ATA channels per I/O controller. Some offer on-board processing and memory for off-loading operations as well as command queuing and tagging when it is beneficial. These ATA solutions are more on par with traditional SCSI I/O approaches than merely "dumb" ATA controllers. Ultimately, as I will explain, it is the combination of storage array design and the applications' needs that should define the storage solution chosen.

ATA RAID Approaches

This article assumes you are familiar with the general concept of Redundant Arrays of Independent Disks (RAID) and Just a Bunch Of Disks (JBOD). Although there are endless ATA RAID products and solutions on the market, they all boil down to four distinct approaches (see Table 1), which I will describe in detail.

Logical Volume Management (LVM) -- The underlying logical volume management (LVM) of Windows NT (2000/XP/2003), Linux's LVM, Sun's Solstice Disk Suite and other solutions, possibly as an add-on (e.g., Veritas, IBM Tivoli, etc... products). Depending on the implementation, the system may still be bootable from the firmware (e.g., PC BIOS) without additional configuration (e.g., a copy of the OS bootstrap with the required LVM drivers is on each disk in the array).

Free RAID (FRAID) Controller -- A standard ATA controller with a 16-bit PC BIOS that adds basic RAID organization options in its firmware, making the disks look like a single volume at boot-time (before the OS loads). The disks are still accessed directly and independently from the standpoint of system I/O. Once a 32/64-bit OS loads, a driver is required to "organize" the disks into a volume the OS can understand. This driver is where all RAID functionality is located, so it is still a software RAID solution, driven by the OS and, more importantly, the CPU of the system (not the controller). FRAID is the most commonplace solution, because it is a "free" way for a PC OEM to add RAID to any ATA controller -- both for on-mainboard and in PCI add-on cards.

Buffering RAID Controller -- The traditional approach used by SCSI RAID solutions. An on-board microcontroller (µC), which is like a microprocessor, but typically not super-pipelined (Pentium class or greater) and is designed for I/O with integrated peripherals or capabilities, controls all transfers with the system I/O. RAID volumes are addressed as a single (or multiple, as organized in firmware) block storage device, and there is no means for the system I/O to directly access the drives. An ample supply of on-board, [Synchronous] Dynamic RAM ([S]DRAM) provides buffered I/O, allowing operations to be extensively queued. There is an embedded OS on the board, which is dedicated to RAID functionality.

Non-Blocking RAID Switch -- An ATA-centric approach used by at least one major vendor. An on-board, application-specific integrated circuit (ASIC) controls all transfers with the system I/O, as a buffering controller, but is hardwired for its specific application. Data transfers through the ASIC without delay, other than some basic command-queuing for organization. This non-buffered approach to I/O requires true 0 wait state memory, such as Static RAM (SRAM). The trade-off is that SRAM is a combinational logic, not a simple cell design like DRAM, so it is much larger in silicon area and far more expensive per byte.

Note that Synchronous DRAM (DRAM) is still a DRAM technology with 40ns+ latencies. Using a synchronous DRAM clock, like with SDRAM, merely limits the impact of these latencies for writes (e.g., it may appear to the system processor that the data is committed within a few clock cycles, but it actually takes dozens of cycles before the DRAM cell is actually written to). For DRAM reads, there is no way to avoid the dozens of cycles waiting, resulting in massive latency, and ultimately bringing down the average DTR over a period of time. DRAM latency is typically mitigated in a modern system by using a small amount of 0 wait state RAM technology near the CPU. Thus, a small amount of SRAM logic is typically embedded in the modern processor (L1 and L2 cache) to keep immediate data readable.

The difference between LVM and FRAID can be confusing. However, FRAID is no different from a "regular" ATA controller, except for some added logic in its on-board 16-bit PC BIOS. The more advanced, 32/64-bit RAID logic required after the OS loads is in the driver, and executes on the processor, just like an OS LVM implementation. The data transfer is still point-to-point between the individual drive and system memory, with the associated overhead (e.g., second copy of data for RAID-1/mirroring, discussed later) for the RAID operations on the system's I/O. In some cases, the CPU is also utilized for FRAID operations (e.g., XOR operations for RAID-5, also discussed later).

In "intelligent" hardware RAID, on-board controller and RAM can off-load the burden of RAID overhead from the system's processor and interconnect. The data transfer occurs between the system memory and on-board controller, in a single data stream regardless of the RAID level of the volume. The on-board controller handles the point-to-point connections to the drives, independent of the rest of the system. In the case of the µC+DRAM approach, the data writes are always buffered in the local, on-board RAM first. In the case of the ASIC+SRAM approach, an "adaptive cut-through" approach (directly to the disks, if the disks are not busy) can be offered alongside the traditional "store and forward" buffering approach.

Note that the use of common OSI layer-2/IEEE802 switch product terminology here is deliberate because Ethernet switches use the same ASIC+SRAM design as "Storage switches." The only difference is the use of PHY interface chips for the Ethernet media channel access instead of ATA controllers for drive channels.

RAID-0

As shown in Table 2, RAID-0 striping can theoretically offer up to n times (where n equals the number of disks) the data transfer rate compared with JBOD -- a linear increase in transfer rate as each disk is added, for both reads and writes. At the same time, the amount of actual data to/from disk versus committed by the OS is a 1:1 for both writes and reads versus JBOD. So, there is no overhead disadvantage except for the lack of redundancy, including destruction of the entire volume if any single disk is lost.

RAID-0 can be most ideally suited for software, rather than a single add-on card solution. The LVM in most OS and software RAID products offer expanded volume management over most hardware RAID-0 cards. From a performance standpoint, using two controllers on two separate I/O buses and interleaving data between controllers in a RAID-0 striping volume is best. With little more than a small amount of command overhead, the additional system processor load in software RAID-0 is negligible. (Note that separate I/O buses mean separate PCI buses on a typical PC mainboard, not merely different PCI slots. Current PC chipsets sporting multiple PCI buses include AMD's 760MP series (Athlon MP) and Intel/ServerWorks' E7500/ServerSet series (Xeon-SMP). AMD's new 64-bit platforms with HyperTransport offer multiple PCI-X bus options and Intel's forthcoming PCI-Express standard will as well.)

Given that most OS LVM implementations for RAID-0 on a system volume are completely bootable -- negating the BIOS advantage offered by a FRAID card -- LVM becomes the better software solution because the OS knows how to organize its own commits directly to disks, rather than going through a vendor's driver logic.

RAID-0 is also the first place to dismiss the option of going with a buffered RAID controller for ATA. The interleaved operation of RAID-0 is as near to a non-blocking transfer as possible. The buffering of writes and, especially, reads, is a redundant operation that the OS is more capable of doing in its own system memory for ATA storage. Buffered RAID controllers for RAID-0 should be limited to those solutions where dozens of disks are required, and they are then implemented with a very high-end SCSI solution (e.g., StrongARM/XScale microcontroller with three or more Ultra320 channels). For fewer disks, a non-blocking RAID switch then becomes the most ideal hardware approach. Like LVM, it will interleave transfers between its disks and offer command-queuing and buffering as appropriate (largely for reads), thereby reducing processor overhead.

RAID-1

As seen in Table 2, RAID-1 mirroring is where write overhead begins to affect software approaches, the cost of redundancy, and fault tolerance. There is essentially a 2:1 write overhead; in other words, twice as much data is committed to disk as is written compared with JBOD. Reads remain 1:1.

As illustrated in Figure 1, software RAID-1 requires twice as much data transfer between the memory and I/O for write operations, effectively tying up the system interconnect for twice as long while offering no improved performance. This still occurs with the majority of FRAID cards. To make matters worse, FRAID card performance can really suffer if a "slave" device is added to a single (parallel) ATA channel where there is an existing "master." Regardless of approach, ATA performance requires a single device per channel for point-to-point, non-blocking I/O. The SerialATA specification smartly avoids offering even the option of a "slave".

RAID-1 is where intelligent hardware RAID begins its effect. Unlike software RAID where two, mirrored I/O operations must occur, a single data transfer occurs between system memory and the RAID controller. The write overhead to the mirroring is essentially moved to the controller, so the overhead now becomes 1:1 versus JBOD from the standpoint of the system. Because mirrors are simply two copies of the same data, the on-board controllers of intelligent ATA RAID solutions are smart enough to interleave multiple reads from different disks. Thus, they reduce disk-seek latency times and theoretically increase read DTR 2:1 over JBOD. Some LVM and almost no FRAID cards offer this performance enhancement. Like RAID-0, RAID-1 is also as near to a non-blocking I/O operation as possible -- especially for interleaved reads.

RAID-5

RAID-5 striping with striped parity is where the write efficiency and overhead quickly become variable, depending on the implementation. From a more finite standpoint, the read DTR is theoretically n-1, as RAID-5 is equivalent to RAID-0 when reading, minus one disk (which is the disk where the parity section of the current block being read is located). Likewise, the efficiency is also n-1, as only one additional disk (regardless of the number of disks total) is required to maintain redundancy.

Note that RAID-3 and RAID-4 striping with dedicated parity are additional options that use a dedicated parity disk, instead of striping parity across all disks. RAID-3 is offered by the Promise SuperTrak series, and is ideal for workstations where access is less random and in more sequential bursts. RAID-4 is not offered by any major solutions listed here, but is highlighted in some enterprise network attached storage solutions like Network Appliance's (NetApp's) WAFL RAID-integrated filesystem. NetApp markets RAID-4 as better than RAID-5 for servicing NFS with large block sizes and large files, which are written in sequential bursts.

Software RAID-5, including FRAID, should be a non-consideration for systems where writes must occur, especially on system partitions like those for temporary, logs, or swap space. Unless the system is reading only from a RAID-5 array, and thus acting like little more than a RAID-0 array (minus one disk), the strain on the system interconnect with software RAID-5 results in a performance hit to average write DTR that is far worse than with software RAID-1.

Besides the actual committal of data to disk, the following process must occur for the parity in software RAID-5, as illustrated in Figure 2:

1. CPU reads data from memory.

2. CPU calculates parity (XOR).

3. CPU writes parity to memory.

4. Memory commits parity to I/O controller.

5. I/O controller commits parity to disk.

Although the XOR operation used to compute the parity segment of a RAID-5 volume is one of the fastest combinational logic circuits, the process still must occur inline with the committal of data. Atop of this delay, we have the read and write delay of memory in the first three steps. This exponentially increases latency of the write, resulting in a DTR that is two- or even threefold slower than JBOD. At the same time, the entire system interconnect is tied up, resulting in a mass inefficiency of the system's resources.

For example, RAIDCore markets its RAID-5 READ performance on its 4-8 channel (4-8 disk) FRAID cards as a better solution than Ultra320. Indeed, the read performance of RAIDCore solutions are outstanding and unparalleled, as the RAID-5 array acts like a virtual RAID-0 array (minus one disk) when reading, while its PCI-X capable I/O provides outstanding bandwidth. But once the RAIDCore moves to RAID-5 writes or RAID-5 rebuilds, the system taxing is no different from any other FRAID solution, because it is simply "dumb" ATA. Benchmarks often do not show how much your system cannot accomplish while it is busy with unnecessary and redundant storage data transfer, especially for XOR operations that incur massive memory transfer.

Intelligent hardware RAID is required for decent RAID-5 performance when frequent writes are the norm. RAID-5 is where non-blocking I/O is less of a consideration, as the XOR operation will require extensive buffering of write operations. If the number of writes is excessive and random, a large buffer will be required. The small cache sizes (e.g., 1-4MB) of ASIC+SRAM solutions are not sufficient, and a traditional microcontroller solution with a large buffer (e.g., 64MB+) should be used.

RAID-0+1 and 0+5

The highest performing RAID-0+1 (aka RAID-10) solution is the use of two (or more) intelligent hardware RAID controllers, each with their own RAID-1 (better yet, two striped RAID-0+1) volumes. They should be placed on separate I/O busses, organized by the OS into a single, RAID-0 volume under its LVM. The write overhead for the RAID-1 portion is equivalent to JBOD from the system's interconnect (off-loaded to the controller), while the write DTR is doubled (quadrupled or more if the hardware volumes are RAID-0+1). Read DTR continues to be "n" -- linearly increased by the number of disks total overall regardless of the use of RAID-1 or RAID-0+1 at the hardware level -- and redundancy is maintained on the collective and individual hardware volume(s).

The same approach can also be used for RAID-0+5 (aka RAID-50), which offers more efficiency (less cost for the same amount of usable storage) than RAID-0+1. The redundancy continues to be the responsibility of the hardware RAID, removing the overhead of XOR operations and preventing the exponentially increased usage of the system interconnect. Although RAID-0+5 write DTR will still vary, and will likely be slower than RAID-0+5, the striping across two (or more) RAID-5 volumes helps reduce the impact.

Product Analysis

Although FRAID is becoming commonplace on typical ATA mainboards, it is typically not worth the bother of the OS configuration effort. The operating system's inherent LVM capabilities for RAID-0 and RAID-1 should be explored instead of FRAID. However, the ATA channels of FRAID controllers can and should be utilized as "dumb" ATA channels (i.e., don't use their integrated BIOS RAID setup). RAIDCore's 4-8 channel FRAID products integrate with the LVM of the underlying platform (including Linux's LVM), so they are an ideal PCI-X hardware selection when implementing a large number of disks in a software RAID-0 or largely read-only RAID-5 solution (instead of purchasing multiple, 2-channel, PCI ATA cards).

Intelligent ATA RAID leaves two remaining approaches:

Adaptec 2400/2800 and Promise SuperTrak
3Ware Escalade

Both the Adaptec 2400/2800 and Promise SuperTrak use the 32-bit Intel i960 µC series. The i960 legacy microcontroller is a woefully underpowered solution for modern RAID, especially at the speeds Promise is using (only the entry-level 33MHz on many SuperTrak models). Most SCSI equivalents have moved to Intel's StrongARM or its XScale series successor, which are far better for high-end solutions if cost is no object. Still, the Adaptec 2400/2800 offers 4- to 8-channel (4-8 disk) solutions for RAID-5 -- far better than using software RAID-5.

On the non-blocking I/O side, there are 3Ware's Escalade "Storage Switch" solutions -- the 7000 series for [parallel] ATA, and the 8000 series for SerialATA. Their 64-bit, non-blocking I/O PCI-ASIC interconnect and 1-4MB of 0 wait state SRAM provide the fastest RAID-0, RAID-1, and read-only RAID-5 performance, while completely off-loading operations from the system interconnect with support for up to 12-channels (12 disk) on some models. The disadvantage of the Escalade is its small 1-4MB SRAM cache. This cache will quickly overflow when extensive writes are made to a RAID-5 volume, effectively killing the Escalade's average write DTR.

Linux Compatibility

The biggest problem with FRAID cards is their reliance on the operating system driver. Without the OS driver, the volume's organization cannot be understood. The OS will see the devices as dumb ATA device, possibly corrupting any RAID volume if a driver is not loaded. In the case of Linux, GPL drivers from the vendors are impossible, because licensed, third-party code cannot be open sourced. The Linux community has come up with its own, generic FRAID driver (ataraid.c), which is then used by the mini-drivers for common FRAID products, but use of these drivers is not recommended. They may not be compatible with newer FRAID products or volumes shared with other operating systems.

Unlike FRAID, intelligent ATA RAID puts the RAID logic in the firmware of the controller, which is then executed by the on-board µC or ASIC. Promise has released GPL Linux drivers for the SuperTrak and continues to support the SuperTrak on Linux. The Adaptec 2400/2800 is an I2O-compliant card, so most admins have had success in using the DPT I2O driver (dpt_i2o.c). 3Ware maintains its own GPL driver (3w-xxxx.c), and it has been included in the stock Linux kernel since version 2.2.15. As with any intelligent storage controller, the 3Ware Escalade product's firmware, OS driver, and user-space tool versions should always be matched as listed in their release notes. Support of these commodity ATA RAID solutions on platforms other than Windows or Linux is typically limited.

Conclusions

Ultimately, choosing between ASIC and traditional microcontroller ATA RAID solutions rest with the perceived cost of devices. The final question to ask may be whether use of cheaper ATA RAID devices negates the waste of RAID-0+1 over RAID-5, with the added benefit of being faster at writes in most applications.

Systems administrators may choose ASIC route if they believe RAID-0+1 is the better route, for little extra cost. 3Ware seems to be the sole solution provider for this solution. At the high-end, two 8- or 12-channel Escalade cards, each with RAID-0+1 volumes on dedicated PCI buses provides an ideal, interleaved setup for using the operating system's LVM to RAID-0 stripe into a combined volume. If the application is largely read-only, RAID-0+5 is also a more cost efficient option that will work well with 3Ware products. I've used a combination of the two in the past, including RAID-0+1 and RAID-5 volumes on the same 8- or 12-channel controller, plus RAID-0 striped over two controllers.

An admin also may prefer RAID-5's disk efficiency, especially with a large device count (dozens). When the number of writes is extensive, the Adaptec 2400 becomes the best, entry-level ATA RAID-5 solution. Higher capacity or performing solutions should include a multi-channel SCSI RAID solution with lots of DRAM (e.g., 256MB+) and a fast microcontroller (e.g., Intel StrongARM or XScale series), as the Adaptec 2400/2800's i960 is quickly outclassed. As with RAID-1, OS LVM RAID-0 stripes across multiple RAID-5 volumes over multiple controllers across multiple I/O buses will result in the best average DTR.

I hope this article will help you make informed choices for your application's storage needs. Its concepts should be readily applicable to analyzing newer ATA RAID solutions as they become available.

Bryan J. Smith holds a BSCpE from UCF, 29 (and holding) IT/vendor certifications, and more than 12 years of combined IT/engineering experience. His storage expertise encompasses high-performance disk arrays for engineering and financial applications on NetApp filers as well as Solaris, Windows, and Linux servers. As of 2004, Bryan enjoys engaging the biggest technology critics of all -- middle school students -- as a math and science teacher. He lives near Orlando with his wife, Lourdes.