Article
Listing 1
Listing 2
Listing 3
Listing 4
Listing 5
Listing 6
Listing 7
Listing 8

may2005.tar

Do-It-Yourself Clusters

Denis Sunko

Our research group recently bought a new, fourth node for our tiny cluster. I added the following three lines into /etc/dhcpd.conf on the master node (see Listing 1):

host osa4 {
        hardware ethernet 00:c3:d4:e5:a1:b2;
        fixed-address osa4;
}

("Osa" means wasp in Croatian.) Then we connected the new box with three wires: power supply, patch cable to the Gigabit Ethernet switch, and null-modem cable from its first serial port to the second serial port of the third box. We turned on the power supply switch and were done, in principle. But I could not resist making a show; I had connected by ssh to the third box and pointed minicom at ttyS1, so we could watch the new node boot from the BIOS up. Of course, I knew it would go smoothly, because everything was the same as with the other boxes before it. Only a misprint in the hardware address above could have spoiled the fun.

How did we make this happen? Lots of leisurely reading, stretched over two months, and not much work, concentrated into several afternoons. This article is a compression of that reading along with a little after-the-fact wisdom. The configuration I will describe in this article is so simple, basic, and useful that I hereby boldly proclaim -- it should be everybody's first cluster configuration.

Preventive Troubleshooting

You can do all the essential troubleshooting before even starting to shop for equipment. To try out the configuration described in this article, you'll need a Linux PC with an Ethernet port, a crossover cable, and a laptop. The laptop does not need to have Linux installed, because it will become a stateless slave node of the PC. It does, however, need the ability to boot over the network; you can check for this in the BIOS setup or the documentation.

The magic element is PXE, which is short for "pre-boot execution environment". It is Intel's public specification, and the de facto standard today. If your laptop does not support it, the Ethernet card on your PC might; however, you probably have an "unbootable" one because they are cheaper. For the PC to be booted over the network, of course the laptop must be under Linux. Given a choice, it is better to configure the laptop as the master, because it can then be used as a portable test-bed at the vendor's, before accepting delivery of the slave nodes.

PXE capability is required for the nodes in the cluster. Another important feature is the ability to export the console over the serial port. Both the bootloader and the kernel can, so it is a major disadvantage if the BIOS cannot and must be approached by a video console instead. Find the manuals for the boards you are considering on the manufacturers' sites, and read the sections on the BIOS and Ethernet controller, if present.

Some preventive troubleshooting is necessary with the supplier, too. First, what if the BIOS does not perform as advertised? Ours has a bug that makes it unable to boot over the network unless a video card is present. The cheapest video card I found was 2% of the price of a single node; if we were buying 100 nodes, we would rate two for free, just on account of that bug. Second, who will configure the BIOS as you need it? Both the console redirect and network boot features are typically off by default, so you need a video card during the first configuration, even if it will not stay there. For a large number of boxes, that is the dreariest part of the entire project, so the willingness to do it may well influence the choice of supplier.

Finally, where will you put your cluster? Even a small one needs a place out of the way, safe and cool, with a spare network connection nearby. Floor space that meets these requirements tends to be restricted. Take measurements to find out the maximum footprint your new toy can have. Make sure you can count on that space -- maybe someone else was there with a tape rule already? Do not underestimate the physical aspects of getting it there either. Power supplies are heavy, and lots of boxes with wires connecting them can be quite unruly, especially in a cramped space. Some kind of rack is needed, possibly on wheels. It should ideally arrive with the boxes, not three weeks later (like ours did).

Non-Invasive Setup

Suppose you want your trusty workstation to become the new cluster's head node. A non-invasive setup is one in which you can unplug the extra boxes and have your workstation back, with no harm done. The issue is not so much practical, as it is a good way to focus on the essentials. Gauging your needs will become much more realistic once you have a simple setup up and running. If you have to reconfigure something, it is better not to be weighed down by a huge previous investment in unnecessary work.

Of course, you do not want to be locked into a configuration that cannot grow with your needs. Extra nodes and disks should be easily added one by one, with no disruption of the existing configuration. This can be achieved with some forethought in configuring the head node. The user area should be managed with the logical volume manager (LVM), which is a standard set of kernel drivers and userspace tools. It allows any number of physical disks to be presented to the system as a single partition, which can grow and shrink as usage requires. This is an advantage even for a single workstation. The file system on top of the LVM should be designed for high performance, such as reiserfs.

I shall mention "clustering" file systems later, but it is against the non-invasive philosophy to implement them before having used that first cluster for some time. This is especially so because the various solutions are designed with different usage patterns in mind. For now, think simply in terms of extending the already working head node with a few extra mainboards.

Nuts and Bolts

I will walk through a boot sequence in our cluster to show what happened when that fourth box was turned on. To begin, the PXE client built into the onboard Ethernet controller sent a request on the network, identifying itself by its hardware address. This was recognized by the DHCP server on the head node, because it had the hardware address in its configuration file. (If it hadn't, it would have rejected the request and noted the address from which it came in syslog, enabling the administrator to update the configuration file.) The server was configured to respond by sending a particular file called "pxelinux.0" to the client, via the so-called trivial file transfer protocol (tftp), so the head node had to act as a tftp server as well. The client acquired the server's and its own IP address in the DHCP exchange and executed the file received by tftp.

Now the fun starts, because "pxelinux.0" is a bootloader. It, in turn, is configured to request the compressed kernel image from the server, and the (similarly compressed) root file system after it. The master and slave can use tftp again, because they know each other's IP address from the PXE step.

When the kernel boots, the bootloader passes it a number of options, which amount to three things. First, export the boot sequence over the serial port console, like the BIOS and bootloader itself did. Second, don't mount a root partition at all; instead use the file system received from the server that is already loaded into RAM. Third, start a new round of requests from the DHCP server to obtain a complete network configuration, including the host name. After a couple seconds, the minicom monitoring the boot sequence will display a login prompt, identifying the console as ttyS0, which is the first serial port, of course. The node is alive and well, having mounted all the other partitions (/usr, /home, etc.) from the server via NFS.

The setup for this is very easy. Install the dhcpd, tftpd, and syslinux packages (the latter contains the bootloader). If you have the kernel and root file system for the slaves, do the obvious by the book, and you are done. Only the serial port console requires some nitpicking, to be described below, but it is not necessary for a successful boot, either. You can connect to the node via ssh and troubleshoot the console at leisure.

To prepare the kernel, compile it on the head node, using a separate tree in the compile directory, say /usr/src/linux-client. This is fun; you get to turn everything off in the configuration! Be sure to disable modular support; you can put it in later if you really need the hassle. Put in ext2 support for the root file system, and NFSv4 (client and server) for the rest. Turn off the virtual terminals, keyboard, mouse, IDE, SCSI, and whatnot, basically keeping only networking and the driver for your Ethernet controller (CONFIG_PACKET and CONFIG_FILTER are also needed for the network boot protocol).

Keep PCI even if you have no PCI cards, since mainboard controllers are typically on the PCI bus. Don't forget to enable console on the serial port; of course, the initrd capability is also needed to put the root file system in RAM. When you are done, don't run make install by reflex (yes, this happens). Copy the kernel to the directory in which the tftp server is configured to look for it, and name it as specified in the pxelinux configuration file. You just compiled a kernel for an embedded Linux system.

The only piece of real work is to prepare the root file system for the clients. Having had a free ride so far, the temptation to hack it is almost too great -- namely, do cp -a of /lib, /bin, and friends, from a live Linux system into a file formatted by mke2fs and mounted under the loopback device, unmount the file, compress it, and be done! I admit I did that the first time, and it even worked, but it is Not Good Enough. The reasons are simple. The nodes need a consistent system, and you need an easy way to administer it.

Linux distributors have developed tools for the diskless community, and it is worthwhile to learn to use them. If your head node is Debian, you need to install the diskless package on the head node, and download (NOT install) the diskless-image-simple package. Utilities from the diskless package will build a live Debian root tree for the release of your choice. The details are at:

http://wiki.debian.net/index.cgi?DiskLess

and the whole procedure is done in 15 minutes. If the directory from which you want to administer the nodes is called "./default", you will end up with the root of the tree called "./default/root". It corresponds to / on a "live" system. The beauty of this is that you can go to ./default/root, say chroot 'pwd' and pretend you are on a freshly installed base system, the one your nodes will get! In fact, this is where you would stand after the first reboot on a new Debian install. You can install packages (sshd and nfs-client, in particular) and do the housekeeping with standard tools. The price paid for simplicity is that the root file system in RAM will end up bigger than strictly necessary (about 28 MB uncompressed in our case), but that is hardly an issue these days. Compressing the image of this file system is the last step in your embedded Linux project for the nodes.

How will the users see the new nodes? Well, for the moment they can get the root prompt on them via ssh, because you forgot to set the root password for the clients. Go back to ./default/root, chroot, passwd, compress the image, reboot the nodes. Good; now the users can't use them at all, because they are not defined on the nodes. One final piece of the puzzle is still lacking, and that is called openMosix:

http://openmosix.sourceforge.net

OpenMosix is a combination of kernel patch and daemon that migrates processes to distribute workload evenly among nodes. For those who remember, it is the closest thing to a VAXcluster on Linux. The process migration is completely transparent to the users, and even to system tools. Running top on the master box of our four-node, eight-processor cluster can show, for instance, eight processes, each taking up 99% of CPU time.

In our case, openMosix is the only clustering piece of software we need. It is a stable, serious product, intended for more general configurations than the simple Beowulf I am describing (see http://www.beowulf.org for the definition). Its unique strength is that the users' cluster experience is completely painless. They don't have to write new code, link against particular libraries, or even be defined on the nodes in order to use the cluster. Of course, some may well choose to write cluster-aware programs. Since openMosix is a kernel patch, it does not get in the way of userspace cluster tools, and I can vouch for MPI in particular.

Installing openMosix is the last step in setting up the single-system image cluster we have. All the necessary software is part of standard Linux distributions and should run out of the box, provided the configuration files are set up correctly. Administration is minimally increased. To keep everything up to date, I only need to run apt-get upgrade twice: once on the home node, and once after chroot on ./default/root for the nodes. Security is no worse than it was on the original head node; if anyone should become root on the slaves, all they can do is destroy the / filesystem in RAM, easily corrected by rebooting. The "real" file systems mounted via NFS are as safe as NFS is.

Servicing the nodes, including BIOS configuration, can be done by ssh from anywhere, even in the middle of a calculation. The openMosix daemon will migrate all processes out of a node that is shutting down. Stability and availability are as good as those of the head node alone, meaning, as usual with Linux, months of uptime between occasional toying with the kernel. If you are running a Web farm or sending a spaceship to Mars, you probably need failover solutions, not to mention a larger cluster, but that is not the subject of an entry-level article.

Tips and Tricks

As anyone who has tried to get a new configuration working can testify, the path to glory is often strewn with quirky little difficulties, one more frustrating than the other by sheer inanity. Because under Linux the software itself usually works, the problems are likely to be in the configuration files, which are luckily flat ASCII (no brain-dead "wizards"). The purpose of this section is to give you a guided tour around (I hope not through) the pitfalls. Some points are illustrated by taking a look at various configuration files (see Listings 1-8).

Interestingly, none of the software above has a really elegant solution for a very common setup, where the head node has two network interface cards, one for external access and one for the private cluster network; see the head node's interfaces file (Listing 2).

openMosix comes closest, with an option (setpe -p<node>) to its "setpe" utility, which explicitly points to the openMosix address used on the dual-NIC node for the cluster interface. There is a related variable MYOMID in the static configuration file, but its use is spoiled by an easily patched bug in the daemon startup script; look under "init script bug" on their bug tracker at SourceForge, if you get that far (this has recently been corrected, but probably not yet in your distribution).

The DHCP server requires all subnets in the head node's hosts file (Listing 3) to appear in the configuration file dhcpd.conf (Listing 1). For example, if your public IP address is 123.45.6.789, you need an explicit no-op line "subnet 123.45.6.0 netmask 255.255.255.0 {}" in the config file. It tells the server it won't be serving requests from the whole wide world. Maybe there is a reason it couldn't figure out as much if it didn't find that line, but I don't know it.

MPI, or at least MPICH on Debian, has no obvious (to me) way to be told not to use the head node for calculations. It automatically invokes localhost in addition to whichever nodes are listed in the configuration file "machines.LINUX". So if you have dual-processor nodes and want to balance the load equally on all, including the head, you need to list the head once and every other node twice in the configuration file.

A related quirk is that MPI grabs the "public" name of the head node, one that is in principle unknown to the nodes on the private internal network (where the head node has a different name), and then there are problems when the slave nodes want to talk back. This can be solved by adding the head node's public name as an alias to its private name in the clients' hosts file (Listing 4). (Because they do not know about the public network, name resolution remains unique on both master and slaves.)

A workaround is sometimes recommended for some of the above issues, to list the internal interface first in the /etc/hosts file, and even hacks are occasionally promoted (like having the same name resolve to two different IP addresses). I have not tried either, but note that both depend on behavior undocumented for /etc/hosts itself. Even if it were documented for the programs in question, it is against good practice to introduce one's own arbitrary requirements on the layout of standard system configuration files. More generally, "grabbing without asking" -- be it processor, network interface, or node name -- is not the way nice software should behave.

The tftp server is only needed when the nodes boot, but when they all boot together (like after a power failure), the traffic can (reportedly) overwhelm a one-shot daemon started by inetd. With our minuscule cluster, tftpd-hpa worked so well with inetd I got bored and replaced it with atftpd, run as a standalone daemon with multi-threading. This is the "big gun" solution, but the daemon overhead is so low that it seems good enough for any purpose.

What is not good enough is that Intel persistently ships PXEs that do not support the blocksize option -- never mind what that is, just don't forget to disable it when configuring tftpd, if your bootable network port is managed by an Intel chipset. We also ran into a bug in atftpd -- that it will not serve too large files. Then it turned out the compressed root file system was large because there were packages lying around in ./default/root/var/cache after installing software for the nodes. Cleaning them out reduced the compressed image from more than 40 MB to slightly less than 12 MB.

There is some delicacy in turning the "live" root file system for the nodes (managed via chroot) into a compressed image they will hold in RAM. The "live" system needs configuration files corresponding to the real head node where it is maintained, otherwise package management won't work. The solution is to have the files that are different on the running nodes -- client's hosts, client's fstab, and client's interfaces (Listings 4-6) -- separately at hand. So, one has the "live" tree, does a cp -a into a file formatted by mke2fs and mounted under loopback, and copies these few files into the node filesystem image at the last moment before compression. All this is most easily managed by a makefile (Listing 7). While you are at it, don't forget to change the option rootcheck=yes to rootcheck=no in the nodes' checkroot.sh initialization script. For some reason, the boot hangs if the root file system in RAM is checked.

Configuring MPI for the users requires a little twist. Remember they are not defined on the nodes because kernel load balancing (openMosix) does not need it. But MPI is userspace; the users must be able to ssh to the nodes. Defining them under chroot in the "live" tree on the head node is simple. Neatness, of course, requires they have the same UID as on the head node. Because /home is mounted from the head node via NFS, their home directories on the "live" tree are just dummies and will not be part of the compressed root file system anyway.

Now comes the twist. The users can enable themselves to use ssh on the slave nodes simply by going to their own .ssh directory on the head node and doing cat id_rsa.pub >> authorized_keys. This is possible because each node sees an identical file system. This means, incidentally, that the users need not know their own passwords on the nodes -- so much the better, since the password files are in RAM there and would need to be maintained off-line by the administrator. Similarly, the users' known_hosts file needs only two new lines -- one corresponding to the head node, and one corresponding to all the diskless nodes. The latter is obtained by noting the known_hosts line generated by the first login to any one of them and changing the name to a wildcard entry (i.e., in our case, replace "osa2" by "osa*").

The only remaining thing to configure is the serial port console. It is like combing thistles out of a bushy dog -- easy in principle, but there are always some left. You need to set up a number of far-flung configurations consistently, and it takes a while for everything to settle in.

The basic tips are: turn flow control off everywhere, and be sure everything is set to the same speed (we use 115200 baud). Set the BIOS redirect, while you still have the video card in; then modify the bootloader configuration file (Listing 8). There are two lines to specify: the line telling the bootloader itself to go serial, and the kernel parameter line instructing the kernel to do so. Then go to ./default/root/etc/inittab and enable the console on the serial port, by de-commenting the line "#T0:23:respawn:/sbin/getty -L ttyS0 115200 vt100". Also comment out the gettys on the virtual terminals (since you haven't got virtual terminals, the gettys would be re-spawning like mad). Put the line "ttyS0" in ./default/root/etc/securetty below the line "console". Finally, set up minicom with the same communication parameters, but looking at ttyS1.

Now if you connect all the serial ports serially with null-modem cables (S1 on the head node to S0 on the first slave, S1 on the first slave to S0 on the second slave, etc.), you can ssh to node n-1 and bring up minicom to watch the console on node n. If everything works, you can give a reboot command on the console and watch the node come back up on the minicom, even allowing you to reconfigure the BIOS from home. One final tip -- don't iterate this, logging into the console via minicom and bringing up another minicom to get the console on the next node. When you have two minicoms in the same xterm, you can't control which one receives the escape sequences and that can lead to weird lockups.

If all these steps seem like a lot, take heart. First, I did it single-handedly in two or three short afternoons, and it was fun. Second, you can do it at your leisure, with a laptop and PC as I described and have everything debugged by the time the real thing arrives. In fact, that is one piece of advice worth repeating -- do everything possible in advance to have the cluster working smoothly from day one. It will give you quite a sense of accomplishment if no one notices that anything needed to be configured at all.

Where to Go from Here

The big step forward in a cluster setup is a clustering file system. I have not been there, so I can only speak from impressions gained by reading. The main advice before you take the plunge seems to be -- be sure you need it, and be sure you know why. There are legends out there, especially in the database context, of users' software rewrites gaining orders of magnitude in performance, while throwing hardware (and administrators) at the problem did nothing. In a configuration such as ours, the least one can do before spending time and money is to take out the Gigabit Ethernet switch and replace it with a Fast Ethernet one, nominally ten times slower. If performance doesn't go down significantly, hardware configuration is unlikely the problem.

The key decision for a clustering file system is whether the expected use is data-centered or computation-centered. In the scientific community, the latter is the norm. Applications may need enormous amounts of scratch space with a high bandwidth per node, with all the output becoming obsolete almost instantly as soon as some aggregates are calculated from it. In our setup, scratch space is offered to users through a mount point /home/data, with subdirectories distinguished by usernames.

This mount point is a natural "hook" for a clustering file system, and I guess PVFS2 would be right for us because it was developed for just such a situation. If you need strong failover capability on top of that, the answer would probably be Lustre. The tradeoff here is not so much "performance" -- whatever that means. It is extra time invested in Lustre vs. time lost because of the weaker crash recovery of PVFS2. The long-term unavoidable time and trouble, not the initial setup, should dominate the decision. The last word in clustering file systems is certainly that human time is the most expensive part of any tradeoff.

Conclusion

The configuration described here was designed by and for people who would prefer to do something else. We are currently enjoying the benefits of some forethought in using a cluster that gives us no more trouble than any old Linux workstation, namely zero. Adding new nodes is as easy as putting in light bulbs, and adding new storage under LVM is as easy as it was on the original head node. I hope to have convinced the gentle reader that this is a good position from which to contemplate further excursions into the mysteries of clustering.

Denis Sunko is a professor of theoretical physics at the Faculty of Science, University of Zagreb, Croatia. He wrote his first program in BASIC when he was 13 and has preferred to do something else ever since. An example of the kind of work for which this cluster was built can be found at: http://arxiv.org/abs/cond-mat/0411187.