Cover V12, I01

Article
Figure 1
Figure 2
Figure 3
Table 1

jan2003.tar

Recursive Use of SystemImager for Cloning Entire Clusters

Scott Delinger

The SystemImager (SI) utility for automated Linux installations is invaluable in the construction of clusters (Figure 1 shows a typical cluster configuration). The SystemImager Web site discusses many of the uses for SI, including server farms, clusters, and desktop environment rollouts. SystemImager is based on a client-server model: the boxes to be imaged are the clients, and the image repository is the master. SI also allows administrators to perform safe upgrades with revision control by saving different images; rolling back to a "last known good" image can return production boxes in service very quickly.

Background

I have had the good fortune to have received too much cluster-building work in the past two years (Table 1). Our department has installed five new clusters in that time, as well as renewing a Windows computing lab that serves as a Linux cluster at night (named werewulf). Our first 16-node cluster, beowulf, which dates from 1999, was constructed the hard way following these steps: build up the master; build up a compute node; take the side panels off the compute node; slave a new node's hard disk to the build node; "dd" the drive image to the new node's drive; change the static IP address and hostname; repeat n times. When building the newer clusters, we needed a better way -- the two clusters built in 2001 each have 20 nodes, and the three built in 2002 have 24, 40, and 192 processors, respectively.

I originally looked for a package to simply clone the compute nodes, and was pleased to discover SystemImager. SI hosts the node (client) image on the master node and clones the bare nodes from this image. Better yet, SI uses DHCP in the process, so I could select from static IPs, dynamic IPs, or dynamically provided IPs locked to MAC addresses. Before using SI to build clusters, I tested it on a desktop Linux box -- changing some files, re-imaging, and generally familiarizing myself with the different aspects of the program.

Inside SystemImager

Briefly, this process involves creating a "golden client" on one of the nodes, and loading the SystemImager client packages onto the golden client. The golden client will serve as a prototype for the nodes in the cluster. Then, you can set up a master where all of the clients can access it, and load the SystemImager server onto the master. The image created on the golden client is then taken up to the master, and that image is available to the other nodes, which can request the image by booting from a floppy, CD, network, or hard disk. Figure 2 provides a schematic of the process.

You can update the image directly on the master (chrooting into the image is suggested) or make a change to a golden client and then take the revised client image up to the master. Updating the other clients only requires an update to the changed files. You do not have to install a complete new image.

Revision control is possible by making a new image of the golden client and pulling this new image down to the other nodes. Although this requires copying a greater amount of data across the network, recovery from a mistake in the new image requires only that the "last known good" image be downloaded to the clients.

I tend to install security patches directly to a copy of the last known good image (copying images on the server is possible). I can then test the image on a node. Once that node appears to have remained stable, I download the patched image to the balance of the nodes. When adding software packages or updating distributions, I create a new image to allow for revision control.

Variations on a Theme by Finley

In 2001, the two clusters (pleaides and naegling) I needed to build were quite similar with respect to hardware (see Table 1) and software load. Since the two clusters were running similar computational chemistry packages in two different research groups, many libraries were common to both clusters. Having built up the master node of pleaides and the first compute node, I quickly (6 minutes/box!) imaged the other 19 nodes. Almost all of the time in construction was consumed by planning and by the installation of all of the libraries and packages needed in the cluster onto the master and that first compute node.

When it came time to build the master node of naegling, sys admin laziness1 set in. Since the master node of pleiades contained the image of its compute nodes and had nearly all of the software load needed by naegling, I thought why not make this cluster master an SI client, make my office desktop a "super" master, and use SI recursively? (See Figure 3.) In other words, why not use SI to clone a complete cluster? The only factor requiring consideration was that the hard disks of naegling's compute nodes were 50% larger, as the master nodes had identical hardware but for the CPU. I installed the SI packages intended for the SI image server onto my desktop and the SI client packages onto pleiades's master, ran "prepareclient" on pleiades's master, and then ran "getimage" on my desktop box. The image uploaded without issue.

Now, naegling's master was to have a static IP on the public NIC, which obviously needed to be different from pleiades's public NIC IP. The SI team had figured that out: a local.cfg file on my boot floppy would be used to push the eth0 networking info down to naegling. The networking information for the compute nodes could match that of pleaides because it used an RFC 1918 private address space. I made a link in the images directory for naegling.sh to point to master.sh and awaited the results. The image of pleiades's master was smoothly downloaded to naegling's master, and after many minutes, the familiar "I've been done for x seconds, reboot me" beeps were heard. I rebooted, and naegling's master node sat awaiting a login. Furthermore, naegling's master node had the SI master node packages installed, as well as the images for the compute nodes, so the next step in the recursion was nearly ready.

To address the hard disk size difference on the compute nodes, I noted that the reason 30-GB disks were purchased for naegling was a desire for additional scratch space on the compute nodes. Therefore, I could leave all partitions but /scr identical to those on pleiades. The script compute_v1.1.sh in /tftpboot/systemimager had a stanza concerning disk partitioning, and all that was required was to increase the /scr size to reflect the balance of the disk reserved for /scr on naegling's compute nodes. I used a boot floppy created for imaging pleiades's compute nodes, and all was well -- the node when rebooted had a /scr partition of 22 GB rather than 13 GB as on pleiades. I then installed the few packages needed on naegling that were not needed on pleiades, pulled a new image up to the master, imaged three more nodes, asked the researchers to test the new packages, and then the remaining nodes were imaged.

Variations Dal Capo

In 2002, three grants written for the purchase of clusters were funded at the same time. To save work (laziness again), as well as to leverage volume purchase power, all of the nodes for the three clusters (oh, hrothgar, and plethora) were purchased together with nearly identical specifications: 126 dual AMD Athlon MP 1800+ (TM) compute nodes with 40-GB hard disks and 3Com 3c920 NICs. Master nodes were similar, except for disk size. A wrinkle in this hardware config is the 3Com (TM) NICs. An updated kernel with the proper version of the NIC driver on the SI boot floppy solved the "I can't receive DHCP leases" problem. Another wrinkle was my use of ext3 filesystems for these newer clusters. fsck'ing 22 GB of temp files when a box crashes mid-calculation is no one's idea of a good time. The symptom was that the nude nodes being imaged wouldn't image any partition other than the root. Editing "updateclient" to allow ext3 as well as ext2 mounted partitions to be imaged solved that.

Now, past experience says to start with plethora's master, and grow /home from 18 GB up to the 36-GB disks on oh and hrothgar. However, because plethora's owner has two clusters already and could wait, I needed to build the similarly equipped oh and hrothgar before building plethora. So in this variation, I needed to shrink a partition. Also, SystemImager had been improved since my last use to the 2.0.1 version, which moved some of the directories and contents around (/tftpboot no longer used in my application), and included System Configurator, which handles newer distributions more easily than what had been hard-coded into SI in the past. As before, as I built up the client images, I "sprayed" the current image out to four or so nodes, tested, got feedback from the users, and then made changes and re-imaged. Very slick. Anticipating larger and more numerous images, a proper image master was purchased with sufficient disk space for saving images for all of the clusters, relieving my desktop box of this task.

I insisted on installing as much of the software that belonged on any one of these three clusters onto all of them. In some cases, certain packages would never get touched on a cluster. On the other hand, building time would be reduced to cloning time. Users were getting frustrated waiting for their cluster while I tested packages. However, when I finished oh, I cloned hrothgar from oh in a single day. Users admitted that doubling their available CPUs in a day was worth the wait.

Plethora was (and is) a bit of a challenge. The difficulty isn't in the cloning; that's straightforward enough. The shrinking of the /home partition was trivial -- I edited the stanza regarding partition sizes as before, and since the material in the /home/* directories required much less than 18 GB, no problems arose. Keeping almost 0.5 THz of computational resources cool has been the challenge. The 20 or so nodes of plethora that are currently running work fine, and I've taught one of the faculty how to add a package to the "golden client" and take an incremental image and pass that onto the other nodes. If something gets fouled, we just image back to the previous working image and we're back running again.

D.S. al Fine: Werewulf

Finally, also during the summer of 2002, we renewed the hardware in a computing lab. In 1997, the lab was outfitted with 40 PentiumII/233 stations with 64 MB RAM, 3-GB HDs, and 17" monitors. During the day, the lab ran Windows 95(TM) on all stations, and undergraduate students used the stations for email, browsing, and report writing. At 17:00, the doors locked and the lights went out. Fifteen minutes later, the stations rebooted into Linux, and ran as a cluster until 08:00, at which point they reverted to Windows. This facility was well received as an overnight cluster, and later the Linux image was used on a few stations during the day to provide X-terminal capabilities for NMR data processing software that runs on Solaris.

This summer, we purchased 40 AMD Duron(TM) 1.1-GHz boxes with 256 MB SDRAM and 20-GB IDE hard disks. We needed to keep the dual-use nature of the laboratory. The Windows image (handled with Symantec's Ghost(TM)) was little changed, once we made the proper alterations for the new hardware. The Linux image created in 1999, however, was quite dated. To quickly renew the Linux side of the lab, I imaged a new master (werewulf) from the image I had for the oh/hrothgar/plethora clusters, then attempted to image a compute node from the new master. Since we'd used SMC 9432 NICs in the lab in 1997 and had DHCP-served static IPs (locked to MAC addresses), we migrated those cards into the new boxes. Under the time constraints we faced, we weren't able to use SI for imaging the 40 new Durons, because the EPIC100 driver wasn't in the standard 2.2.18 kernel provided with SI. Given another day or two before classes starting, we could have made it. The new version of SI being developed will use modular NIC drivers, removing one of the last hindrances I've encountered with SI. Overall, recursive use of SI has saved me weeks of time in a large project, as well as offer software revision control.

1 "Laziness is a very important system administrative virtue." from Essential System Administration, 2nd Ed. by Æleen Frisch. O'Reilly & Associates, p. 342.

Scott Delinger is a lapsed Ph.D. analytical chemist who now serves as the IT Administrator for the Department of Chemistry at the University of Alberta. He maintains seven supercomputational clusters in one of the largest cluster facilities in Canada. He spends time off-line with his wife and daughter. He can be emailed at: scott.delinger@ualberta.ca.