Monitoring
and Managing Linux Software RAID
Ryan Matteson
Systems administrators managing a data center face numerous challenges
to achieve required availability and uptime. Two of the main challenges
are shrinking budgets (for hardware, software, and staffing) and
short deadlines in which to deliver solutions. The Linux community
has developed kernel support for software RAID (Redundant Array
of Inexpensive Disks) to help meet those challenges. Software RAID,
properly implemented, can eliminate system downtime caused by disk
drive errors. The source code to the Linux kernel, the RAID modules,
and the raidtools package are available at minimal cost under the
GNU Public License. The interface is well documented and comprehensible
to a moderately experienced Linux systems administrator.
In this article, I'll provide an overview of the software RAID
implementation in the Linux 2.4.X kernel. I will describe the creation
and activation of software RAID devices as well as the management
of active RAID devices. Finally, I will discuss some procedures
for recovering from a failed disk unit.
Introduction to RAID
RAID is a set of algorithms for writing data blocks to disk devices.
Each RAID mode, or level, specifies the layout of data blocks on
multiple disks. Each RAID mode provides an enhancement in one aspect
of data management: redundancy or reliability, read or write performance,
or logical unit capacity. Simple RAID modes are named with an integer
number: RAID 0, RAID 1, or RAID 5. Complex RAID modes that combine
multiple simple modes are named with a combined name: RAID 0+1,
RAID 1+0.
RAID 0 is used to enhance the read/write performance of large
data sets, and to increase logical unit capacity beyond the limits
of a single disk device. RAID 0 can also be used to write consecutive
data blocks onto multiple disk devices to improve read/write performance
of large data sets. RAID 0 provides no data recovery capability.
RAID 1, or mirroring, is used to provide high reliability and
fault tolerance. In RAID 1, each data block is written to multiple
disk devices simultaneously. If one disk device were to fail, all
data could be recovered from one of the mirror disks. The cost of
RAID 1 data sets increases with the number of mirror sets used.
RAID 5, also referred to as striping with parity, distributes
data blocks and parity blocks across all devices in a RAID device.
Parity calculations are performed for each write operation, and
used to regenerate data if a disk failure is detected. With the
ability to stripe data across RAID 5 devices, read performance can
be optimized. RAID 5 also maximizes available disk storage, allowing
RAID devices to gain additional capacity without losing redundancy.
Complex RAID modes involve the combination of the benefits of
single RAID levels into the same logical unit. RAID 0+1 is the striping
or concatenation of multiple disk devices into a larger logical
unit with additional disk devices allocated to mirror the striped
devices. A 45-Gb RAID 0+1 logical unit would require ten 9-Gb disk
devices. Data blocks would be striped or concatenated on five of
the disk devices. Simultaneously, each data block would be written
to to one of the other five disk devices to provide a mirror for
the entire logical unit. RAID 1+0 is the use of mirrored disk devices
to form larger striped or concatenated logical units. There is no
difference between the logical unit presented to the operating system
from RAID 0+1 and RAID 1+0 -- it is 45 Gb either way. The same number
of disk devices are required to implement either.
Current Linux kernels support RAID -- both hardware and software
-- for many disk devices and controllers. The distinction between
hardware RAID and software RAID is in location of the RAID mode
implementation. Hardware RAID solutions require specialized hardware
(disk controllers, disk enclosures, and/or drives). Software RAID
is implemented by the operating system of the server to which the
devices are attached. The trade-off is between price and performance.
Most hardware RAID controllers have dedicated processing units and
non-volatile cache memory. The controller can acknowledge the write
as completed to the operating system immediately and perform the
physical writes to disk later, increasing performance and removing
processing load from the server. Some RAID hardware devices support
replacing physical disk units without taking the server offline.
Software RAID is an emulation of what hardware RAID devices do.
There are some disadvantages of software RAID compared to hardware
RAID: some disk device write performance is lost; there is additional
processing burden on the server; and the hot-swappability of disk
units is not available. However, the cost of standard disk controllers
and devices is much less than those that support RAID modes in hardware.
Often a combination of hardware RAID devices and software RAID will
provide a flexible and maintainable solution that fits within the
availability and budget constraints of the application.
Hardware RAID controllers allocate storage from a pool of available
disks into a logical unit of disk storage, which is presented to
the operating system. Most RAID controllers support RAID 0, RAID
1, RAID 5, RAID 0+1, and RAID 1+0. When RAID layouts are implemented
in software, the kernel is responsible for managing individual disk
units. The RAID drivers keep track of which disk units are assigned
to each logical unit and where to read or write the raw data. The
RAID logical unit is presented to the operating system as an abstract
disk, upon which any type of Linux disk filesystem (e.g., ext2,
ext3, reiserfs) can be installed. The filesystem interface is unaware
both that the "disk unit" is actually an array of multiple disks
and of how the data is laid out among them.
The remainder of this article will deal specifically with the
Linux RAID implementation in software. In Linux documentation, the
software RAID implementation is also referred to as MD (multiple
disk). Many of the commands demonstrated are from the raidtools
package that must be installed to manage RAID devices. The mdadm
package is also available to create, manage, and monitor MD devices.
This tool provides a variety of advanced features, but will not
be covered in this article. For additional information on mdadm,
please see the references.
A word of caution -- the examples in this article are from my
test RAID systems. Please study all relevant documentation and plan
carefully before adding or changing system parameters. I highly
recommend that everything be implemented and tested on a non-production
system before making changes to any live systems.
Linux Software RAID Implementation
The Linux kernel supports RAID 0, RAID 1, RAID 4, or RAID 5. RAID
devices can also be combined to implement RAID1+0 or RAID 0+1 layouts
for additional availability or performance. The kernel also supports
the allocation of one or more hot spare disk units per RAID device.
A hot spare disk is one that is not used to store data or parity
blocks -- it is available to the RAID device for recovery if one
of the other disks comprising the device fails. When a disk fault
is detected, the operating system begins to resynchronize data from
the failed disk onto the hot spare disk. The faulty disk drive can
be replaced later.
Kernel Support
A recent version of the Linux kernel should be used to implement
software RAID. All examples in this article are from kernel version
2.4.20. Some Linux distributions include kernels with RAID support
precompiled into the kernel, and contain the raidtools package which
is required to manage software RAID devices. Recent versions of
Redhat and SuSE distributions include kernels with built-in RAID
support, which can be configured at installation time, or once the
server is operational.
If you compile a new kernel, you must enable RAID support in your
kernel configuration. Include support for all RAID layouts (modes)
that will be used:
Multiple devices driver support (RAID and LVM) (CONFIG_MD) [Y/n/?] y
RAID support (CONFIG_BLK_DEV_MD) [M/n/y/?] y
Linear (append) mode (CONFIG_MD_LINEAR) [M/n/y/?] y
RAID-0 (striping) mode (CONFIG_MD_RAID0) [M/n/y/?] y
RAID-1 (mirroring) mode (CONFIG_MD_RAID1) [M/n/y/?] y
RAID-4/RAID-5 mode (CONFIG_MD_RAID5) [M/n/y/?] y
Once the kernel has been configured with support for software RAID,
it must be compiled and installed, and the system should be booted
with the new kernel. The dmesg command can verify that software
RAID is enabled in the running kernel:
$ dmesg | grep ^md | head -3
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun .
The presence of md messages indicates that the kernel has loaded the
software RAID drivers and is attempting to detect RAID devices attached
to the server. During the autodetect process, the kernel reads the
partition identification tags of each disk partition available to
the system. Any partition that has a partition identification tag
of 0xFD is a RAID partition. The operating system will attempt to
start each RAID partition at autodetect time. The fdisk utility can
be used to view the partition tags of any disk device:
$ fdisk -l /dev/hda
Disk /dev/hda: 100.0 GB, 100030242816 bytes255 heads, \
63 sectors/track, 12161 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 12095 97153056 fd Linux raid autodetect
/dev/hda2 12096 12160 522112+ fd Linux raid autodetect
Software RAID Devices
A software RAID device consists of one or more disk partitions
(or other disk devices) that are combined to form a RAID device.
The devices are combined in one of the supported basic RAID modes:
RAID 0, RAID 1, RAID 4, or RAID 5. The configuration file that assigns
partitions and records other details of each RAID device is /etc/raidtab.
Each section of /etc/raidtab that defines a software RAID device
begins with the keyword raiddev:
raiddev /dev/md0
raid-level 1
nr-raid-disks 2
persistent-superblock 1
nr-spare-disks 0
device /dev/hda1
raid-disk 0
device /dev/hdc1
raid-disk 1
The device name is identified in the raiddev directive. The RAID mode
is defined in the raid-level directive. The nr-raid-disks directive
specifies how many disks (partitions or other disk devices) will be
used to comprise the RAID device. The persistent-superblock instructs
the kernel to write the RAID configuration data at the end of each
partition, which helps the kernel to identify the RAID configuration
in the system. The nr-spare-disks directive is used to define hot-spare
disks, described later. Following the general device defininitions,
each disk device that will comprise the RAID device is specified with
a device directive. The device specified can be a disk partition,
a disk device, or another RAID device. Each device directive can be
further specified by other directives. The raid-disk directive indicates
the relative position within the RAID layout of each disk device (from
0 to (nr-raid-disks - 1). For striping layouts, such as RAID 5, this
indicates the column where that disk device will lie. For RAID 1 and
RAID 5, it is important to allocate disk units to each raid-disk that
have the same physical disk geometry (cylinders, sectors, tracks)
and are the same size.
RAID 5 devices are defined with the raid-level directive, and
a number 5 to specify the RAID mode. An example three-column RAID
device would be:
raiddev /dev/md1
raid-level 5
nr-raid-disks 3
nr-spare-disks 0
persistent-superblock 1
parity-algorithm left-symmetric
chunk-size 32
device /dev/hda1
raid-disk 0
device /dev/hdc1
raid-disk 1
device /dev/hde1
raid-disk 2
The chunk-size directive specifies how many kilobytes will be written
as a data block to each column of the RAID device. This is also the
size of the parity block that will be calculated for each data block.
The optimal value for chunk-size depends on the application and the
disk devices. The parity-algorithm directive specifies how parity
blocks will be calculated based on the data blocks, and how to organize
the location of the parity blocks across columns.
Initializing a Software RAID Device
Each software RAID device that is defined in /etc/raidtab must
be initialized. The mkraid utility creates the device node,
initializes all the devices that comprise the RAID device, and starts
the RAID device. Be aware that initialization with mkraid
will destroy data that currently exists on any of the devices that
are specified in /etc/raidtab for that RAID device:
$ mkraid /dev/md0
The status of RAID device initialization can be monitored via /proc/mdstat.
When the devices have been initialized and started, filesystems
can be created on the RAID device:
$ mke2fs -j /dev/md0
$ mkdir /data/snowball
$ mount /dev/md0 /data/snowball
After a filesystem has been created on the RAID device, the RAID device
can be managed like any other Linux filesystem. To have a filesystem
mount upon system boot, add an entry in /etc/fstab:
/dev/md0 /data/snowball ext3 defaults 1 2
Once mounted as a filesystem, the RAID device is used like any other
filesystem on the server.
Stopping and Starting Software RAID Devices
Under normal circumstances, software RAID devices are started
during the autodetect process. Some events may require that the
RAID devices be stopped while the server is still running. Before
stopping a software RAID device, the kernel forces any buffered
data to be written to the RAID device. After all buffered data has
been written, the RAID device is stopped. The data on a stopped
RAID device is not accessible to the operating system. The underlying
disk devices that comprise the RAID device remain accessible, although
they should not be modified or the RAID device will be corrupted.
The raidstop utility is used to stop a RAID device. Mounted
filesystems on the RAID device must be unmounted before stopping
the underlying device:
$ umount /data/snowball
$ raidstop /dev/md0
A stopped software RAID device must be explicitly started before it
can be used by the server. The raidstart utility is used to
start a RAID device. Once started, the filesystem can be mounted:
$ raidstart /dev/md0
$ mount /dev/md0 /data/snowball
Hot Spare Disk Devices
The Linux software RAID implementation supports one or more hot
spare devices to be assigned to a RAID device. A hot spare device
is a disk device that is available to a RAID device to replace one
of the component disk devices in case of a disk fault or failure.
Spare disks enable the RAID device to continue to operate, and begin
recovery procedures, in real-time. When the kernel receives notice
of a failed disk device that is part of a RAID device, the RAID
device is checked for an available hot spare device. If a spare
disk is available, the kernel will logically replace the faulted
disk with the spare in the RAID device.
RAID modes support continued operation with the loss of one disk
unit in the device because each data block can be resynchronized
from the other disks, either as a mirror block, or as a parity block.
The nr-spare-disks directive in /etc/raidtab indicates how many
spare disks are available to the RAID device. A device directive
is specified for each spare disk. A spare-disk directive follows
the device directive instead of the raid-disk directive. The following
excerpt from /etc/raidtab defines the previous example RAID device
with one hot spare disk available:
raiddev /dev/md0
raid-level 1
nr-raid-disks 2
persistent-superblock 1
nr-spare-disks 1
device /dev/hda1
raid-disk 0
device /dev/hdc1
raid-disk 1
device /dev/hde1
spare-disk 0
The raidhotadd utility will add a hot spare disk to a RAID
device that is running. raidhotadd will not modify /etc/raidtab.
If a spare disk is added using raidhotadd, the systems administrator
must add it to the appropriate RAID device specification in /etc/raidtab
before the system reboots:
$ raidhotadd /dev/md0 /dev/hde1
The resynchronization of a hot spare device can be monitored via /proc/mdstat.
Monitoring Software RAID Devices
The Linux software RAID implementation reports the status of all
RAID devices in /proc/mdstat. /proc/mdstat is a pseudo-file that
can be read by any Linux utility that manipulate text files (e.g.,
more, grep). When all RAID devices are started and
operating correctly, the following output would be seen:
$ cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hda1[0] hdc1[1] hde1[2]
97152960 blocks [2/2] [UU]
For each RAID device, the output includes the status (active), the
RAID mode (raid1), a list of partitions that comprise the RAID device
and their order, total device size, partition, and a status code letter
for each partition. The output contains the software RAID device,
a list of active partitions comprising that device, size information,
and a status code letter for each partition. A status code of U indicates
that the partition is operational. A status code of _ (underscore)
indicates that a partition had a disk fault. When a partition has
faulted, it is removed from the list of active partitions.
The following output would be seen for the RAID device when /dev/hdc1
is operational:
$ cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hda1[0] hdc1[1] hde1[2]
97152960 blocks [2/2] [UU]
The following output would be seen for the same RAID device when /dev/hdc1
faulted:
$ cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 hda1[0] hde1[2]
97152960 blocks [2/1] [U_]
The following output would be seen for the same RAID device when /dev/hde
is being used to resynchronize the data from hda:
$ cat /proc/mdstat
md0 : active raid1 hda1[0] hde1[2]
97152960 blocks [2/1] [U_]
[========>............] recovery = 43.9% (42713908/97152960) \
finish=82.4min speed=11002K/sec
The following example script monitors /proc/mdstat for indication
of a RAID device failure. If a RAID device fails, a message is logged
via syslog and an email message is sent to the on-call pager. The
script can be run periodically via cron or modified to run continuously
as a daemon on system startup:
#!/bin/bash
ADMIN="page-oncall@mydomain.com"
HOSTNAME='/bin/hostname'
if egrep "\[.*_.*\]" /proc/mdstat > /dev/null
then
logger -p daemon.error "mdcheck: Failure of \
one or more software RAID devices"
echo "Failure of one or more software RAID \
devices on ${HOSTNAME}" | /bin/mail -s \
"$0: Software RAID device failure on \
${HOSTNAME}" ${ADMIN}
fi
Recovering from a Failed Disk Drive
Eventually, a disk drive comprising a software RAID device will
fail and require replacement. The failed disk can be determined
from the output of /proc/mdstat, and the failed unit must be removed
and replaced with a working disk unit. It is important to replace
the disk with one that is physically identical to the failed disk.
The replacement disk also must be partitioned identically to the
failed disk. The raidhotadd command can then be issued to
tell the RAID device drivers to activate the replaced disk and begin
the resynchronization of data. The progress of resynchronization
can be monitored via /proc/mdstat:
$ raidhotadd -a /dev/md0 /dev/hdc1
Conclusion
Linux software RAID provides systems administrators with the means
to implement the reliability and performance of RAID without the
cost of hardware RAID devices. The kernel supports all basic RAID
modes and complex RAID devices can be created by using RAID devices
as logical partitions. The examples presented were simple RAID 1
and RAID 5 configurations. The judicious allocation of hot spare
devices to RAID units will reduce the risk of performance degradation
when a disk device fails. The best results will be achieved by planning
the desired configuration in a diagram before modifying system configurations.
Any changes should be thoroughly tested before implementing them
on a production system.
References
Linux Software RAID HOWTO -- http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html
MDADM MD device management package -- http://www.cse.unsw.edu.au/~neilb/source/mdadm/
Acknowledgements
Ryan would like to thank Rob Jenson from Spotch Consulting for
editing the article. He would also like to thank the kernel developers,
and the architects and programmers responsible for the MD device
driver code.
Ryan Matteson has been a Unix systems administrator for eight
years. He specializes in emergent and Web technologies, Storage
Area Networks, high-availability systems, Linux, and Solaris operating
systems. Questions and comments about this article can be addressed
to matty91@bellsouth.net.
|