Do-It-Yourself
Clusters
Denis Sunko
Our research group recently bought a new, fourth node for our
tiny cluster. I added the following three lines into /etc/dhcpd.conf
on the master node (see Listing 1):
host osa4 {
hardware ethernet 00:c3:d4:e5:a1:b2;
fixed-address osa4;
}
("Osa" means wasp in Croatian.) Then we connected the new
box with three wires: power supply, patch cable to the Gigabit Ethernet
switch, and null-modem cable from its first serial port to the second
serial port of the third box. We turned on the power supply switch
and were done, in principle. But I could not resist making a show;
I had connected by ssh to the third box and pointed minicom at ttyS1,
so we could watch the new node boot from the BIOS up. Of course, I
knew it would go smoothly, because everything was the same as with
the other boxes before it. Only a misprint in the hardware address
above could have spoiled the fun.
How did we make this happen? Lots of leisurely reading, stretched
over two months, and not much work, concentrated into several afternoons.
This article is a compression of that reading along with a little
after-the-fact wisdom. The configuration I will describe in this
article is so simple, basic, and useful that I hereby boldly proclaim
-- it should be everybody's first cluster configuration.
Preventive Troubleshooting
You can do all the essential troubleshooting before even starting
to shop for equipment. To try out the configuration described in
this article, you'll need a Linux PC with an Ethernet port,
a crossover cable, and a laptop. The laptop does not need to have
Linux installed, because it will become a stateless slave node of
the PC. It does, however, need the ability to boot over the network;
you can check for this in the BIOS setup or the documentation.
The magic element is PXE, which is short for "pre-boot execution
environment". It is Intel's public specification, and
the de facto standard today. If your laptop does not support it,
the Ethernet card on your PC might; however, you probably have an
"unbootable" one because they are cheaper. For the PC
to be booted over the network, of course the laptop must be under
Linux. Given a choice, it is better to configure the laptop as the
master, because it can then be used as a portable test-bed at the
vendor's, before accepting delivery of the slave nodes.
PXE capability is required for the nodes in the cluster. Another
important feature is the ability to export the console over the
serial port. Both the bootloader and the kernel can, so it is a
major disadvantage if the BIOS cannot and must be approached by
a video console instead. Find the manuals for the boards you are
considering on the manufacturers' sites, and read the sections
on the BIOS and Ethernet controller, if present.
Some preventive troubleshooting is necessary with the supplier,
too. First, what if the BIOS does not perform as advertised? Ours
has a bug that makes it unable to boot over the network unless a
video card is present. The cheapest video card I found was 2% of
the price of a single node; if we were buying 100 nodes, we would
rate two for free, just on account of that bug. Second, who will
configure the BIOS as you need it? Both the console redirect and
network boot features are typically off by default, so you need
a video card during the first configuration, even if it will not
stay there. For a large number of boxes, that is the dreariest part
of the entire project, so the willingness to do it may well influence
the choice of supplier.
Finally, where will you put your cluster? Even a small one needs
a place out of the way, safe and cool, with a spare network connection
nearby. Floor space that meets these requirements tends to be restricted.
Take measurements to find out the maximum footprint your new toy
can have. Make sure you can count on that space -- maybe someone
else was there with a tape rule already? Do not underestimate the
physical aspects of getting it there either. Power supplies are
heavy, and lots of boxes with wires connecting them can be quite
unruly, especially in a cramped space. Some kind of rack is needed,
possibly on wheels. It should ideally arrive with the boxes, not
three weeks later (like ours did).
Non-Invasive Setup
Suppose you want your trusty workstation to become the new cluster's
head node. A non-invasive setup is one in which you can unplug the
extra boxes and have your workstation back, with no harm done. The
issue is not so much practical, as it is a good way to focus on
the essentials. Gauging your needs will become much more realistic
once you have a simple setup up and running. If you have to reconfigure
something, it is better not to be weighed down by a huge previous
investment in unnecessary work.
Of course, you do not want to be locked into a configuration that
cannot grow with your needs. Extra nodes and disks should be easily
added one by one, with no disruption of the existing configuration.
This can be achieved with some forethought in configuring the head
node. The user area should be managed with the logical volume manager
(LVM), which is a standard set of kernel drivers and userspace tools.
It allows any number of physical disks to be presented to the system
as a single partition, which can grow and shrink as usage requires.
This is an advantage even for a single workstation. The file system
on top of the LVM should be designed for high performance, such
as reiserfs.
I shall mention "clustering" file systems later, but
it is against the non-invasive philosophy to implement them before
having used that first cluster for some time. This is especially
so because the various solutions are designed with different usage
patterns in mind. For now, think simply in terms of extending the
already working head node with a few extra mainboards.
Nuts and Bolts
I will walk through a boot sequence in our cluster to show what
happened when that fourth box was turned on. To begin, the PXE client
built into the onboard Ethernet controller sent a request on the
network, identifying itself by its hardware address. This was recognized
by the DHCP server on the head node, because it had the hardware
address in its configuration file. (If it hadn't, it would
have rejected the request and noted the address from which it came
in syslog, enabling the administrator to update the configuration
file.) The server was configured to respond by sending a particular
file called "pxelinux.0" to the client, via the so-called
trivial file transfer protocol (tftp), so the head node had to act
as a tftp server as well. The client acquired the server's
and its own IP address in the DHCP exchange and executed the file
received by tftp.
Now the fun starts, because "pxelinux.0" is a bootloader.
It, in turn, is configured to request the compressed kernel image
from the server, and the (similarly compressed) root file system
after it. The master and slave can use tftp again, because they
know each other's IP address from the PXE step.
When the kernel boots, the bootloader passes it a number of options,
which amount to three things. First, export the boot sequence over
the serial port console, like the BIOS and bootloader itself did.
Second, don't mount a root partition at all; instead use the
file system received from the server that is already loaded into
RAM. Third, start a new round of requests from the DHCP server to
obtain a complete network configuration, including the host name.
After a couple seconds, the minicom monitoring the boot sequence
will display a login prompt, identifying the console as ttyS0, which
is the first serial port, of course. The node is alive and well,
having mounted all the other partitions (/usr, /home, etc.) from
the server via NFS.
The setup for this is very easy. Install the dhcpd, tftpd, and
syslinux packages (the latter contains the bootloader). If you have
the kernel and root file system for the slaves, do the obvious by
the book, and you are done. Only the serial port console requires
some nitpicking, to be described below, but it is not necessary
for a successful boot, either. You can connect to the node via ssh
and troubleshoot the console at leisure.
To prepare the kernel, compile it on the head node, using a separate
tree in the compile directory, say /usr/src/linux-client. This is
fun; you get to turn everything off in the configuration! Be sure
to disable modular support; you can put it in later if you really
need the hassle. Put in ext2 support for the root file system, and
NFSv4 (client and server) for the rest. Turn off the virtual terminals,
keyboard, mouse, IDE, SCSI, and whatnot, basically keeping only
networking and the driver for your Ethernet controller (CONFIG_PACKET
and CONFIG_FILTER are also needed for the network boot protocol).
Keep PCI even if you have no PCI cards, since mainboard controllers
are typically on the PCI bus. Don't forget to enable console
on the serial port; of course, the initrd capability is also needed
to put the root file system in RAM. When you are done, don't
run make install by reflex (yes, this happens). Copy the
kernel to the directory in which the tftp server is configured to
look for it, and name it as specified in the pxelinux configuration
file. You just compiled a kernel for an embedded Linux system.
The only piece of real work is to prepare the root file system
for the clients. Having had a free ride so far, the temptation to
hack it is almost too great -- namely, do cp -a of /lib,
/bin, and friends, from a live Linux system into a file formatted
by mke2fs and mounted under the loopback device, unmount the file,
compress it, and be done! I admit I did that the first time, and
it even worked, but it is Not Good Enough. The reasons are simple.
The nodes need a consistent system, and you need an easy way to
administer it.
Linux distributors have developed tools for the diskless community,
and it is worthwhile to learn to use them. If your head node is
Debian, you need to install the diskless package on the head node,
and download (NOT install) the diskless-image-simple package. Utilities
from the diskless package will build a live Debian root tree for
the release of your choice. The details are at:
http://wiki.debian.net/index.cgi?DiskLess
and the whole procedure is done in 15 minutes. If the directory from
which you want to administer the nodes is called "./default",
you will end up with the root of the tree called "./default/root".
It corresponds to / on a "live" system. The beauty of this
is that you can go to ./default/root, say chroot 'pwd' and
pretend you are on a freshly installed base system, the one your nodes
will get! In fact, this is where you would stand after the first reboot
on a new Debian install. You can install packages (sshd and nfs-client,
in particular) and do the housekeeping with standard tools. The price
paid for simplicity is that the root file system in RAM will end up
bigger than strictly necessary (about 28 MB uncompressed in our case),
but that is hardly an issue these days. Compressing the image of this
file system is the last step in your embedded Linux project for the
nodes.
How will the users see the new nodes? Well, for the moment they
can get the root prompt on them via ssh, because you forgot to set
the root password for the clients. Go back to ./default/root, chroot,
passwd, compress the image, reboot the nodes. Good; now the users
can't use them at all, because they are not defined on the
nodes. One final piece of the puzzle is still lacking, and that
is called openMosix:
http://openmosix.sourceforge.net
OpenMosix is a combination of kernel patch and daemon that migrates
processes to distribute workload evenly among nodes. For those who
remember, it is the closest thing to a VAXcluster on Linux. The process
migration is completely transparent to the users, and even to system
tools. Running top on the master box of our four-node, eight-processor
cluster can show, for instance, eight processes, each taking up 99%
of CPU time.
In our case, openMosix is the only clustering piece of software
we need. It is a stable, serious product, intended for more general
configurations than the simple Beowulf I am describing (see http://www.beowulf.org
for the definition). Its unique strength is that the users'
cluster experience is completely painless. They don't have
to write new code, link against particular libraries, or even be
defined on the nodes in order to use the cluster. Of course, some
may well choose to write cluster-aware programs. Since openMosix
is a kernel patch, it does not get in the way of userspace cluster
tools, and I can vouch for MPI in particular.
Installing openMosix is the last step in setting up the single-system
image cluster we have. All the necessary software is part of standard
Linux distributions and should run out of the box, provided the
configuration files are set up correctly. Administration is minimally
increased. To keep everything up to date, I only need to run apt-get
upgrade twice: once on the home node, and once after chroot
on ./default/root for the nodes. Security is no worse than it was
on the original head node; if anyone should become root on the slaves,
all they can do is destroy the / filesystem in RAM, easily corrected
by rebooting. The "real" file systems mounted via NFS
are as safe as NFS is.
Servicing the nodes, including BIOS configuration, can be done
by ssh from anywhere, even in the middle of a calculation. The openMosix
daemon will migrate all processes out of a node that is shutting
down. Stability and availability are as good as those of the head
node alone, meaning, as usual with Linux, months of uptime between
occasional toying with the kernel. If you are running a Web farm
or sending a spaceship to Mars, you probably need failover solutions,
not to mention a larger cluster, but that is not the subject of
an entry-level article.
Tips and Tricks
As anyone who has tried to get a new configuration working can
testify, the path to glory is often strewn with quirky little difficulties,
one more frustrating than the other by sheer inanity. Because under
Linux the software itself usually works, the problems are likely
to be in the configuration files, which are luckily flat ASCII (no
brain-dead "wizards"). The purpose of this section is
to give you a guided tour around (I hope not through) the pitfalls.
Some points are illustrated by taking a look at various configuration
files (see Listings 1-8).
Interestingly, none of the software above has a really elegant
solution for a very common setup, where the head node has two network
interface cards, one for external access and one for the private
cluster network; see the head node's interfaces file (Listing
2).
openMosix comes closest, with an option (setpe -p<node>)
to its "setpe" utility, which explicitly points to the
openMosix address used on the dual-NIC node for the cluster interface.
There is a related variable MYOMID in the static configuration file,
but its use is spoiled by an easily patched bug in the daemon startup
script; look under "init script bug" on their bug tracker
at SourceForge, if you get that far (this has recently been corrected,
but probably not yet in your distribution).
The DHCP server requires all subnets in the head node's hosts
file (Listing 3) to appear in the configuration file dhcpd.conf
(Listing 1). For example, if your public IP address is 123.45.6.789,
you need an explicit no-op line "subnet 123.45.6.0 netmask
255.255.255.0 {}" in the config file. It tells the server it
won't be serving requests from the whole wide world. Maybe
there is a reason it couldn't figure out as much if it didn't
find that line, but I don't know it.
MPI, or at least MPICH on Debian, has no obvious (to me) way to
be told not to use the head node for calculations. It automatically
invokes localhost in addition to whichever nodes are listed in the
configuration file "machines.LINUX". So if you have dual-processor
nodes and want to balance the load equally on all, including the
head, you need to list the head once and every other node twice
in the configuration file.
A related quirk is that MPI grabs the "public" name
of the head node, one that is in principle unknown to the nodes
on the private internal network (where the head node has a different
name), and then there are problems when the slave nodes want to
talk back. This can be solved by adding the head node's public
name as an alias to its private name in the clients' hosts
file (Listing 4). (Because they do not know about the public network,
name resolution remains unique on both master and slaves.)
A workaround is sometimes recommended for some of the above issues,
to list the internal interface first in the /etc/hosts file, and
even hacks are occasionally promoted (like having the same name
resolve to two different IP addresses). I have not tried either,
but note that both depend on behavior undocumented for /etc/hosts
itself. Even if it were documented for the programs in question,
it is against good practice to introduce one's own arbitrary
requirements on the layout of standard system configuration files.
More generally, "grabbing without asking" -- be it
processor, network interface, or node name -- is not the way
nice software should behave.
The tftp server is only needed when the nodes boot, but when they
all boot together (like after a power failure), the traffic can
(reportedly) overwhelm a one-shot daemon started by inetd. With
our minuscule cluster, tftpd-hpa worked so well with inetd I got
bored and replaced it with atftpd, run as a standalone daemon with
multi-threading. This is the "big gun" solution, but the
daemon overhead is so low that it seems good enough for any purpose.
What is not good enough is that Intel persistently ships PXEs
that do not support the blocksize option -- never mind what
that is, just don't forget to disable it when configuring tftpd,
if your bootable network port is managed by an Intel chipset. We
also ran into a bug in atftpd -- that it will not serve too
large files. Then it turned out the compressed root file system
was large because there were packages lying around in ./default/root/var/cache
after installing software for the nodes. Cleaning them out reduced
the compressed image from more than 40 MB to slightly less than
12 MB.
There is some delicacy in turning the "live" root file
system for the nodes (managed via chroot) into a compressed image
they will hold in RAM. The "live" system needs configuration
files corresponding to the real head node where it is maintained,
otherwise package management won't work. The solution is to
have the files that are different on the running nodes -- client's
hosts, client's fstab, and client's interfaces (Listings
4-6) -- separately at hand. So, one has the "live"
tree, does a cp -a into a file formatted by mke2fs and mounted
under loopback, and copies these few files into the node filesystem
image at the last moment before compression. All this is most easily
managed by a makefile (Listing 7). While you are at it, don't
forget to change the option rootcheck=yes to rootcheck=no in the
nodes' checkroot.sh initialization script. For some reason,
the boot hangs if the root file system in RAM is checked.
Configuring MPI for the users requires a little twist. Remember
they are not defined on the nodes because kernel load balancing
(openMosix) does not need it. But MPI is userspace; the users must
be able to ssh to the nodes. Defining them under chroot in the "live"
tree on the head node is simple. Neatness, of course, requires they
have the same UID as on the head node. Because /home is mounted
from the head node via NFS, their home directories on the "live"
tree are just dummies and will not be part of the compressed root
file system anyway.
Now comes the twist. The users can enable themselves to use ssh
on the slave nodes simply by going to their own .ssh directory on
the head node and doing cat id_rsa.pub >> authorized_keys.
This is possible because each node sees an identical file system.
This means, incidentally, that the users need not know their own
passwords on the nodes -- so much the better, since the password
files are in RAM there and would need to be maintained off-line
by the administrator. Similarly, the users' known_hosts file
needs only two new lines -- one corresponding to the head node,
and one corresponding to all the diskless nodes. The latter is obtained
by noting the known_hosts line generated by the first login to any
one of them and changing the name to a wildcard entry (i.e., in
our case, replace "osa2" by "osa*").
The only remaining thing to configure is the serial port console.
It is like combing thistles out of a bushy dog -- easy in principle,
but there are always some left. You need to set up a number of far-flung
configurations consistently, and it takes a while for everything
to settle in.
The basic tips are: turn flow control off everywhere, and be sure
everything is set to the same speed (we use 115200 baud). Set the
BIOS redirect, while you still have the video card in; then modify
the bootloader configuration file (Listing 8). There are two lines
to specify: the line telling the bootloader itself to go serial,
and the kernel parameter line instructing the kernel to do so. Then
go to ./default/root/etc/inittab and enable the console on the serial
port, by de-commenting the line "#T0:23:respawn:/sbin/getty
-L ttyS0 115200 vt100". Also comment out the gettys on the
virtual terminals (since you haven't got virtual terminals,
the gettys would be re-spawning like mad). Put the line "ttyS0"
in ./default/root/etc/securetty below the line "console".
Finally, set up minicom with the same communication parameters,
but looking at ttyS1.
Now if you connect all the serial ports serially with null-modem
cables (S1 on the head node to S0 on the first slave, S1 on the
first slave to S0 on the second slave, etc.), you can ssh to node
n-1 and bring up minicom to watch the console on node n. If everything
works, you can give a reboot command on the console and watch the
node come back up on the minicom, even allowing you to reconfigure
the BIOS from home. One final tip -- don't iterate this,
logging into the console via minicom and bringing up another minicom
to get the console on the next node. When you have two minicoms
in the same xterm, you can't control which one receives the
escape sequences and that can lead to weird lockups.
If all these steps seem like a lot, take heart. First, I did it
single-handedly in two or three short afternoons, and it was fun.
Second, you can do it at your leisure, with a laptop and PC as I
described and have everything debugged by the time the real thing
arrives. In fact, that is one piece of advice worth repeating --
do everything possible in advance to have the cluster working smoothly
from day one. It will give you quite a sense of accomplishment if
no one notices that anything needed to be configured at all.
Where to Go from Here
The big step forward in a cluster setup is a clustering file system.
I have not been there, so I can only speak from impressions gained
by reading. The main advice before you take the plunge seems to
be -- be sure you need it, and be sure you know why. There are
legends out there, especially in the database context, of users'
software rewrites gaining orders of magnitude in performance, while
throwing hardware (and administrators) at the problem did nothing.
In a configuration such as ours, the least one can do before spending
time and money is to take out the Gigabit Ethernet switch and replace
it with a Fast Ethernet one, nominally ten times slower. If performance
doesn't go down significantly, hardware configuration is unlikely
the problem.
The key decision for a clustering file system is whether the expected
use is data-centered or computation-centered. In the scientific
community, the latter is the norm. Applications may need enormous
amounts of scratch space with a high bandwidth per node, with all
the output becoming obsolete almost instantly as soon as some aggregates
are calculated from it. In our setup, scratch space is offered to
users through a mount point /home/data, with subdirectories distinguished
by usernames.
This mount point is a natural "hook" for a clustering
file system, and I guess PVFS2 would be right for us because it
was developed for just such a situation. If you need strong failover
capability on top of that, the answer would probably be Lustre.
The tradeoff here is not so much "performance" --
whatever that means. It is extra time invested in Lustre vs. time
lost because of the weaker crash recovery of PVFS2. The long-term
unavoidable time and trouble, not the initial setup, should dominate
the decision. The last word in clustering file systems is certainly
that human time is the most expensive part of any tradeoff.
Conclusion
The configuration described here was designed by and for people
who would prefer to do something else. We are currently enjoying
the benefits of some forethought in using a cluster that gives us
no more trouble than any old Linux workstation, namely zero. Adding
new nodes is as easy as putting in light bulbs, and adding new storage
under LVM is as easy as it was on the original head node. I hope
to have convinced the gentle reader that this is a good position
from which to contemplate further excursions into the mysteries
of clustering.
Denis Sunko is a professor of theoretical physics at the Faculty
of Science, University of Zagreb, Croatia. He wrote his first program
in BASIC when he was 13 and has preferred to do something else ever
since. An example of the kind of work for which this cluster was
built can be found at: http://arxiv.org/abs/cond-mat/0411187. |