Cover V12, I01

Article
Table 1

jan2003.tar

Solstice DiskSuite Soft Partitions -- Not Hard at All

Matthew Cheek

The Sun Solaris Operating Environment (OE) is a mature implementation of UNIX that runs both on Sun SPARC and Intel x86 architecture hardware. The native file system of Solaris is the venerable UNIX File System (UFS) and, except for minor changes (e.g., UFS logging) over the years, it has not really kept pace with the explosive growth of storage technology, in terms of both capacity and performance. Fortunately, Sun has always offered DiskSuite1, an add-on logical volume manager for the Solaris OE that provided fault-tolerant features such as concatenation, striping, mirroring, RAID5, and logging. However, both UFS and DiskSuite have always been constrained by the physical disk partition table, effectively limiting the number of file systems per disk to eight. This was not much of a problem when disks were relatively small (i.e., less than two gigabytes). As individual disk capacities continued to increase, the eight-slice limit became more of an issue, especially from a space management standpoint.

This article will introduce DiskSuite Soft Partitions, a new feature that shatters the eight-slice barrier and greatly extends the capabilities of the DiskSuite storage management product.

DiskSuite Overview

DiskSuite is a layered product that enables administrators to manage a system's disks through the concept of virtual disk devices called metadevices. From the point of view of a user or application, a DiskSuite metadevices is seen as and accessed identically to a physical disk device through the use of the DiskSuite metadisk device driver, which coordinates I/O to and from physical disk devices and metadevices.

While the most basic metadevice is simply a single disk partition, such a metadevice is not very useful. The value of DiskSuite is only realized when complex metadevices are used to increase data availability or storage capacity.

DiskSuite permits the creation of metadevices that consist of multiple physical disk partitions. This allows the administrator to have file systems that are larger than any single disk device by distributing the data across multiple physical disk slices. Such a complex metadevice can either be a concatenation, which distributes the data "end-to-end", or a stripe, which alternates chunks of data across disk slices. A concatenation is typically used to "grow" an existing metadevice by attaching additional slice(s) to the end. A striped metadevice provides better performance by causing I/O operations to be spread across multiple physical disks. Here is the command to create a concatenated metadevice:

# metainit d10 3 1 c1t0d0s2 1 c1t1d0s2 1 c1t2d0s2
d10: Concat/Stripe is setup
The concatenation metadevice created in this example consists of three "stripes" (the number 3) each made up of a single slice (the number 1 preceding each slice). In this example, all three slices are actually the entire disk. DiskSuite then displays a confirmation message that the metadevice was successfully set up.

Similarly, a stripe metadevice is created with this command:

# metainit d20 1 3 c1t0d0s2 c1t1d0s2 c1t2d0s2
d20: Concat/Stripe is setup
This striped metadevice, d20, is made up of a single stripe (the number 1) spread across three slices. Since an interlace values was not specified with the "-i" switch, the default size of each data segment per slice is 16 Kbytes. DiskSuite concatenations and stripes are also known as RAID level 02.

Mirroring

Be aware that neither a concatenation nor a stripe by itself provides any level of fault tolerance. The failure of a single physical device in a concatenated or striped metadevice results in the loss of the entire metadevice. DiskSuite's answer to this is the mirror metadevice. DiskSuite mirroring provides data integrity by writing data to two or more disk devices. A DiskSuite mirror metadevice is made of one or more other metadevices referred to as submirrors.

For example, the following commands will create a two-way mirror metadevice (d1), which comprises two simple submirror metadevices (d2 and d3), each corresponding to a physical disk partition:

# metainit d2 1 1 c0t1d0s2
d2: Concat/Stripe is setup
# metainit d3 1 1 c0t2d0s2
d3: Concat/Stripe is setup
# metainit d1 -m d2
d1: Mirror is setup
# metattach d1 d3
d1: Submirror d3 is attached
The resulting d1 mirror metadevice can then be used just as if it were a physical disk slice. However, every write request to this mirror metadevice will actually be written to each of the two submirror metadevices (d2 and d3) and, hence, to each of the two physical disk partitions. This mirroring (also known as RAID level 1), provides redundant copies of your data. Loss of either underlying physical disk partition will not impact the operation of the mirror metadevice, and the repair or replacement of the failed device can be scheduled at a convenient opportunity.

Typically, the desired attributes of performance and capacity along with availability are achieved through the use of mirror metadevices composed of striped or concatenated submirror metadevices. This results in logical volumes that are immune to at least single disk failures and sometimes multiple simultaneous disk failures, depending on whether the entire logical stripe remains intact. DiskSuite striped mirrors are also known as RAID level 1+0, or simply 10.

RAID5

While mirrored metadevices are the most fault tolerant, the physical disk requirements are exactly double that of a simple metadevice. If you are unwilling or unable to dedicate twice as many disks to meet your space requirements but still wish to have data redundancy, DiskSuite provides the RAID5 metadevice. As the name implies, a DiskSuite RAID5 metadevice implements RAID level 5, which is striping with parity and data distributed across all the physical disk partitions in the metadevice.

If a physical disk fails, the data on the failed disk can be rebuilt from the distributed data and parity information stored on the remaining disks. The parity information is stored in one slice's worth of space, but is actually distributed across all the slices in the RAID5 metadevice. This means that the actual data capacity of a RAID5 metadevice is n-1 times the size of each slice, where n equals the number of slices in the metadevice. For instance, a RAID5 metadevice composed of five 18-gigabyte slices results in approximately 72 gigabytes of usable space (4 x 18). An example command to create a DiskSuite RAID5 metadevice is:

# metainit d100 -r c2t0d0s2 c2t1d0s2 c2t2d0s2 c2t3d0s2 c2t4d0s2
d100: RAID is setup
The RAID5 metadevice d100 is created from five slices. DiskSuite confirms that the RAID5 metadevice is set up and begins initializing the metadevice. Initialization is the process of "zeroing out" all the disk blocks in all the slices. The time to complete this initialization depends on the size of the RAID5 metadevice and the speed of the disks. Until the initialization process is complete, the RAID5 metadevice is unavailable for use.

A DiskSuite RAID5 metadevice can suffer the loss of only a single member slice and continue operating, albeit in a degraded mode, as the missing data is reconstructed from the parity information on the remaining slices. It is important to replace a failed disk in a RAID5 metadevice as soon as possible because the loss of a second disk while the metadevice is still degraded will result in the loss of the entire metadevice.

Creating a File System

In most cases, the next step after configuring DiskSuite metadevices (concatenations, stripes, mirrors, or RAID5) is to create a UFS file system on the metadevice. Once this is complete, the file system can be mounted and made available for use. For instance, the following commands create and mount a file system on the example d100 RAID5 metadevice from above:

# newfs /dev/md/dsk/d100
newfs: construct a new file system /dev/md/rdsk/d100: (y/n)? y
/dev/md/rdsk/d100: 141449528 sectors in 30019 cylinders of 19 tracks, 248 sectors
        69067.2MB in 1365 cyl groups (22 c/g, 50.62MB/g, 6208 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 103952, 207872, 311792, 415712, 519632, 623552, 727472, 831392, 935312, 1039232, 1143152, 1247072, 1350992,
 1454912, 1558832, 1662752, 1766672, 1870592, 1974512, 2078432, 2182352, 2286272, 2390192, 2494112, 2598032,
.
.
.
# mount /dev/md/dsk/d100 /mnt
# df -k /mnt
Filesystem        kbytes    used  avail capacity  Mounted on
/dev/md/dsk/d100  70174288 562468 77557168    1%   /mnt
Please note that this section is meant only as a brief conceptual overview of DiskSuite's concatenation, stripe, mirror, and RAID5 metadevices and how they are used to increase data performance, capacity, and availability. Also note that the details of configuring and maintaining the DiskSuite metadevice state database replicas, DiskSuite Trans metadevices for logging, or DiskSuite hotspares are not covered at all. Refer to the DiskSuite Installation and Reference Guides for details on implementing DiskSuite.

DiskSuite Soft Partitions

Introduced to DiskSuite v4.2.1 for Solaris 8 (and subsequently back ported to DiskSuite v4.2 for Solaris 7 and 2.6), DiskSuite Soft Partitions are a new feature that provides for much greater flexibility in managing disk resources. Simply put, with Soft Partitions, administrators can logically subdivide a physical disk or a DiskSuite concatenation, stripe, mirror, or RAID5 metadevice into more than eight partitions. This is especially important as disks become larger and disk arrays present even larger logical devices to the system. A simple example will demonstrate the value of Soft Partitions.

Let's revisit our d100 RAID5 metadevice from the previous section. After creating and initializing the metadevice, the next step is to create a single UFS file system. However, what if a single 72-gigabyte file system is just too large? Before the introduction of Soft Partitions, there was no other option. Now, rather than a single huge file system, we can create one or more smaller Soft Partitions on the RAID5 metadevice and use them independently. For example, we could create one or more soft partitions on top of that 72-gigabyte d100 RAID5 metadevice:

# metainit d101 -p d100 1g
d101: Soft Partition is setup
# metainit d102 -p d100 2g
d102: Soft Partition is setup
The previous two commands create two soft partitions on top of d100: one that is 1 gigabyte in size, and another 2 gigabtyes in size. (The last parameter of these metainit commands is the desired size of the soft partition and can be specified in K or k for kilobytes, M or m for megabytes, G or g for gigabytes, T or t for terabyte (one terabyte is the current maximum soft partition size), or B or b for blocks.) Now we can create file systems on these soft partition metadevices and mount them for use:

# newfs /dev/md/rdsk/d101
<newfs output suppressed for brevity>
# mount /dev/md/dsk/d102 /mnt1

# newfs /dev/md/rdsk/d102
<newfs output suppressed for brevity>
# mount /dev/md/dsk/d102 /mnt2

# df -k /mnt1 /mnt2
Filesystem        kbytes used  avail capacity Mounted on
/dev/md/dsk/d101  983599  9   924575    1%    /mnt1
/dev/md/dsk/d102 2030887  9  1847086    1%    /mnt2
Each of these two newly created soft partition metadevices are using a portion of the d100 metadevice, and the remaining 69 gigabytes are available for creating other soft partitions and/or increasing the size of existing soft partitions. For example, if it becomes necessary to enlarge the 1-gigabyte file system to 5 gigabytes, the following commands will accomplish that, all without unmounting the file system:

# metattach d101 4g
d101: Soft Partition has been grown
# growfs -M /mnt1 /dev/md/rdsk/d101
<growfs output suppressed for brevity>

# df -k /mnt1
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/md/dsk/d101     4921773       9 4862749     1%    /mnt1
Besides creating soft partitions on pre-existing metadevices, DiskSuite permits the creation of soft partitions directly on physical disk partitions or even entire disks. This last capability is useful when a disk array presents logical volumes to the system that are actually multiple disk logical units (LUNs). For instance, the Sun StorEdge RAID Manager is used to group a set of physical drives in a disk array into a LUN that is either RAID level 0, 1, 3, or 5. This LUN is then seen by Solaris as one drive. Typically, administrators desire to group the physical drives in an array into large, manageable sets. As a result, these LUNs can be tens or hundreds (or more) of gigabytes in size and, prior to the availability of DiskSuite soft partitions, admins were limited to no more than eight file systems per LUN. The following example demonstrates the creation of multiple soft partitions right onto a physical drive (c3t0d0):

# metainit d110 -p -e c3t0d0 10g
d110: Soft Partition is setup
# metainit d111 -p c3t0d0s0 5g
d111: Soft Partition is setup
# metainit d112 -p c3t0d0s0 15g
d112: Soft Partition is setup
The "-e" switch in the first metainit command indicates that the entire disk (as specified by "c*t*d*") should be repartitioned and reserved for soft partitions. This option can only be used the first time a soft partition is placed on an entire disk. Keep in mind that the specified disk will be repartitioned such that slice 7 is reserved for metadevice state database replica(s) and slice 0 contains the remaining space. Slice 7 will be a minimum of two megabytes in size, but could be larger depending on the particular characteristics of the disk. Be aware that the previous partition layout will be overwritten. The initial soft partition (d110 in the example above) will be placed on slice 0. Subsequent soft partitions (e.g., d111, d112) are also placed on slice 0 by omitting the "-e" switch and specifying the disk component as c*t*d*s0. DiskSuite manages the layout of soft partitions on slice 0.

The metastat command is used to display the status of all metadevices. Here is the output showing the previous three soft partitions built on a single drive:

# metastat
d110: Soft Partition
    Component: c3t0d0s0
    State: Okay
    Size: 20971520 blocks
        Extent           Start Block           Block count
             0                     1              20971520

d111: Soft Partition
    Component: c3t0d0s0
    State: Okay
    Size: 10485760 blocks
        Extent           Start Block           Block count
             0              20971522              10485760

d112: Soft Partition
    Component: c3t0d0s0
    State: Okay
    Size: 31457280 blocks
        Extent           Start Block           Block count
             0              12582915              31457280
Upgrading DiskSuite

Soft Partitions are only available on DiskSuite v4.2.1 on Solaris 8 and DiskSuite v4.2 on Solaris 7 and 2.6. Additionally, the Soft Partitions feature was added to the base DiskSuite product via a product patch. In other words, although you may be running DiskSuite v4.2.1 or v4.2, you may or may not have soft partitions. The simplest way to determine whether your DiskSuite product is soft partition-aware is to run the metainit command with no options. If the usage message shows the "-p" option, you have soft partitions:

# metainit
usage:  metainit [-s setname] [-n] [-f] concat/stripe numstripes
                width component... [-i interlace]
                [width component... [-i interlace]] [-h hotspare_pool]
        metainit [-s setname] [-n] [-f] mirror -m submirror...
                [read_options] [write_options] [pass_num]
        metainit [-s setname] [-n] [-f] RAID -r component...
                [-i interlace] [-h hotspare_pool]
                [-k] [-o original_column_count]
        metainit [-s setname] [-n] [-f] trans -t master [log]
        metainit [-s setname] [-n] [-f] hotspare_pool [hotspare...]
        metainit [-s setname] [-n] [-f] softpart -p [-e] device size
        metainit [-s setname] [-n] [-f] md.tab_entry
        metainit [-s setname] [-n] [-f] -a
        metainit -r
If you do not have soft partitions, but are running one of the specified versions of DiskSuite on a supported version of Solaris, all that is necessary to add soft partition support is to download and apply the latest DiskSuite Product Patch from SunSolve Online (http://sunsolve.sun.com) and reboot. The minimum required patch versions are shown in Table 1. While these are the minimum DiskSuite Product Patch versions to add soft partitions, I strongly recommend that the very latest version of these important patches be applied. This Product Patch includes not only bug fixes but also enhancements to the DiskSuite product.

Solaris Volume Manager

Starting with Solaris 9, DiskSuite is renamed the "Solaris Volume Manager" or "SVM" and has been completely integrated into the Solaris 9 OE. SVM is now the standard logical volume manager and the name change is simply a reflection of that new emphasis. There are no significant core differences between Solstice DiskSuite v4.2.1 on Solaris 8 and the Solaris Volume Manager on Solaris 9. For all intents and purposes, SVM IS Solstice DiskSuite v4.2.1 and, as such, includes the Soft Partition feature.

Summary

Soft Partitions free the Solaris administrator from having to think of storage only in terms of disk slices. Simply configure physical disk devices into large DiskSuite metadevice storage "pools" and use DiskSuite Soft Partitions to carve the pools up as needed.

I hope that this introduction to DiskSuite Soft Partitions will cause those Solaris systems administrators already using DiskSuite in their environments to revisit their configurations and possibly implement soft partitions. In addition, I encourage Solaris systems administrators not currently using DiskSuite to reconsider its place in their environments.

Matthew Cheek is a senior UNIX systems administrator with experience in the telecommunications, healthcare, and manufacturing industries and has installed, configured, managed, and written about UNIX systems since 1988. He is the lead author of Tru64 UNIX System Administrator's Guide from Digital Press. Matt is currently at Medical Archival Systems, Inc. and can be contacted at: cheek@mars-systems.com.

1 DiskSuite was originally named "Online: DiskSuite" and was compatible with both Solaris 1 (i.e., SunOS 4.1.x) and Solaris 2. Beginning with Solaris 2.4, it was renamed "Solstice DiskSuite" and it continued to be called that until, with the release of Solaris 9, it became the "Solaris Volume Manager". In this article, I adopt the common shorthand of calling it simply "DiskSuite".

2 RAID is an acronym for Redundant Array of Inexpensive (or Independent) Disks. There are seven RAID levels, 0-6, each referring to a strategy of distributing data across disks while ensuring data redundancy. DiskSuite supports level 0 (concatenations and stripes), RAID level 1 (mirrors), RAID level 5 (striped with parity information for redundancy), and RAID level 0+1 (striped mirrors). Use your favorite search engine to find resources describing the various RAID levels.