Article

jul2004.tar

TKCluster

Tom Kunz

Linux has shown a lot of growth in the area of data-centric, high-availability clustering. Most admins are already familiar with computational clusters, known loosely as Beowulf clusters, which are implemented in the form of MPI, PVM, LAM, MOSIX, and other process-sharing and process-distributing technologies. There are also "Web service clusters", such as those distributed in years past by TurboLinux and others. These were typically groups of similarly configured servers that used DNS and round-robin IP address tricks to give the illusion of Web server high-availability to end users.

Cohesive operation between the nodes, however, was still only achieved through a shared-storage medium, such as Fibre Channel or shared SCSI, which are prohibitively expensive for small businesses, or proprietary cluster hardware and software, which is also prohibitively expensive. A database engine that serves a Web cluster must still itself be clustered to achieve true high-availability. Application-level high-availability tools (such as the MySQL database engine) that transparently replicate themselves between servers, are also being used to provide some level of redundancy.

The one area in which Linux still starves for attention is in the realm of lightweight, easily configured, affordable high-availability -- a general-purpose cluster. A general-purpose, high availability cluster must be "application agnostic" -- it should not care what runs on it, whether it be Web server, mail server, database server, or any future, yet-unknown type of service. The cluster should give a uniform style of operation no matter what application is running.

In response to this, I have written TKCluster (when I initially wrote it, I couldn't think of a good name for it, so I just prefixed "cluster" with my initials). TKCluster is available under the GPL so that anyone can freely download and modify it to suit their needs.

Overview

TKCluster itself is a cluster manager. Raw data replication between nodes is performed by the wonderful DRBD driver by Phillipp Reisner. DRBD is a block device that maps to a given raw disk partition and a socket. Writes to the DRBD device (/dev/nb0 .. /dev/nbX) go to both the physical disk in the local machine, as well as to the waiting secondary node over a standard Ethernet connection. All clusters require some kind of "heartbeat" mechanism. After experimenting with various ones, I chose openMosix. openMosix was designed to share computation-intensive process loads between multiple machines; however, I have yet to find anything that does as good a job at maintaining a frequently updated list of connected machines.

The process-load sharing and MFS filesystem (analogous to traditional NFS, but infinitely smarter) make openMosix a perfect candidate for helping to tie the cluster nodes together. Although MFS is not necessary for TKCluster operation, it sure helps when copying configuration files around between the nodes of the cluster. TKCluster's role is to use the data gleaned from both openMosix and DRBD to make decisions about starting services, seizing control of the cluster IP address, and keeping the sister copy of TKCluster on the other node of the cluster aware of what's going on.

Design

TKCluster is intended for use in clusters where data is to be shared between two nodes, a primary and a secondary. The data that is "shared" between them lives on one or more partitions, each partition having its own DRBD device. These partitions should be separate from the system partitions (/, /boot, /usr, /var, etc.), because the secondary node will have no access to these partitions until the primary dies (and the secondary seizes control) or the primary gives up control of the partition. The secondary simply waits for something to happen and then acts accordingly to take control of the partition and restart the services previously served by the primary.

These DRBD devices are "raw" devices, they need not have a filesystem on them. Database engines such as Oracle or Informix can be configured to use "raw" partitions. Because DRBD is a block device driver, it simply passes raw block writes through to the local disk. The application running on top of it has no concern about the underlying device, as long as requests are satisfied and the driver behaves as a block device driver should. A DRBD device can also contain a traditional filesystem such as ext3, and be mounted in the usual fashion. DRBD allows the data sharing to be thoroughly agnostic of the apps that talk to it.

When the cluster is fully configured properly and initially powered up, the primary will talk to the secondary and push its data over to the secondary as needed. DRBD has the capability to do a full synchronization, meaning a direct, block-for-block copy of the entire partition, or of doing a "fast" synchronization, where it only copies over changes it finds. For the most part, a cold boot of the cluster will always result in a full sync, while momentary loss of connectivity between primary and secondary may result in "fast" syncs.

Once the sync is done, the secondary's partition will have all the data of the primary's. All future writes on the primary will flow over to the secondary and be committed there as well.

While up, openMosix will be active on both machines and will maintain its own list of what machines are currently up. The process-sharing capabilities of openMosix may be used, although most servers do not cause excessive computational loads. openMosix judiciously migrates processes that are strictly computational, and generally keeps I/O-bound processes local to the machine on which they were started. The main purpose of openMosix will be to actively collect availability data so that the secondary node of the cluster can accurately and reliably detect when a primary node dies, and then grab the services of the primary accordingly.

Because DRBD is intended for duplication of data between only two nodes, scaling the data high availability of TKCluster beyond two nodes is not possible. However, thanks to openMosix, if TKCluster is to be used to service a computational cluster, additional "openMosix-only" nodes can be added to the cluster. These nodes would not be running DRBD, and would only participate in the computationally intensive processes in the cluster. The computational horsepower of the cluster can be multiplied as needed by adding more and more "openMosix-only" nodes to the cluster, although this is generally only needed in scientific and research-oriented endeavors.

Hardware

TKCluster is intended to enable small- to medium-sized businesses to gain the benefits of data high availability without paying large sums for commercial clustering hardware and software. As such, this article will highlight the installation and usage of TKCluster as it is currently installed at one of my customer's sites. The hardware used in this example is not the cheapest of the cheap, nor is it the most expensive. It is all fairly standard, commonly available PC hardware.

If you choose to duplicate this particular installation today, depending on your hardware vendor, initial hardware outlay will be less than $2000. Most likely it will be substantially less. Even lower-powered machines are quite usable with TKCluster, however, my customer specifically chose this configuration because of previous experience with the same hardware. I am a small business owner, and I have designed TKCluster so that others in a similar situation should be able to use it as well.

Two machines, identically configured in hardware in BIOS, are preferred to make life easy on the administrator. It's possible to have very different hardware and chipsets between the two nodes of the cluster, however, I strongly recommend identical hardware configurations. This will vastly simplify the administrative overhead if anything is ever changed inside the servers. The hardware selected for each machine for this configuration consists of the following:

Asus P4P800 motherboard (includes 3Com 3c940 gigabit LAN and Intel 865G/ICH5 chipset)
512MB Crucial PC-3200 RAM (2 x 256M DIMMs)
Western Digital 36GB SATA "Raptor" HD
Intel 2.6 GHz CPU, 800 MHz FSB
Intel Pro/1000 MT gigabit NIC
A generic case, power supply, video card, CD-ROM, and floppy

Hardware gurus familiar with Intel products will immediately notice that the motherboard and NIC are not traditionally considered "server-class", because of the lack of the faster PCI-X architecture. However, in this configuration, specifically aiming to be affordable for the small business owner, it will become apparent as to why a PCI-X motherboard and PCI-X version of the Intel Pro/1000 gigabit card are not completely necessary.

The limiting factor of this particular configuration is not necessarily the PCI bus. Major performance benefits will not be realized by simply adding PCI-X to the configuration. If you bring up one of the two nodes of the cluster with Fedora Core 1.0, the libata drivers from Fedora will address the SATA hard drive as /dev/sda. Once installed and running, you can run hdparm -Tt /dev/sda and get something like the following:

# hdparm -Tt /dev/sda

/dev/sda:
 Timing buffer-cache reads:  3076 MB in  2.00 seconds = 1538.00 MB/sec
 Timing buffered disk reads:  160 MB in  3.03 seconds =   52.81 MB/sec

Repeated executions of hdparm will average something about 1500-1700 MB/sec cached, and 50-53 MB/sec uncached. The maximum sustained throughput of this particular disk is about 53 MB/sec. In the grand scheme of things, virtually all hard drives, whether ATA, SATA, or SCSI, have a maximum sustained throughput of between 40 and 60 MB/sec, so this particular disk is in the upper-middle range of performance. There are other articles that analyze the performance throughput of various configurations, but without going up to a RAID configuration using much more expensive controller cards and multiple disks, the best single-disk throughput realizable with today's hard-drive technology is a sustained 60 MB/sec or so. For small business owners, where every penny counts, cost is a compelling reason to stay with a single disk on the built-in controller and not opt for a more expensive RAID controller connected to several disks.

Some users may see the glossy advertisements of "320 MB/sec SCSI" and start thinking their systems simply must have it. However, remember that the advertised 320 MB/sec is not the sustained throughput; it is only the speed at which data moves from the host controller SCSI card to the disk's electronics and into the cache on the disk itself. SATA disk caches are typically 8M and only the most expensive SCSI disks have 16M of cache onboard. The actual cache-to-platter speed is the important speed; it is that speed that limits sustained throughput to the disk.

While this particular disk is physically limited to about 53 MB/sec, the theoretical limit of a gigabit NIC, connected to the fastest possible PCI slot, is 125 MB/sec. The NIC in this is not the limiting factor for duplicating data between the nodes of the cluster; the disk is still the limiting factor. The practical limits of a NIC on a plain PCI bus (not PCI-X) are still only a bit faster than that of the hard drive itself.

The choice of a cheaper motherboard and no RAID controller sounds logical in a theoretical sense -- it seems to work out on paper that the limiting factor will be the hard drive itself. Increasing disk performance could be a major expense, which could prohibit small businesses from wanting to buy into an expensive pair of machines. But some are still likely to be skeptical, believing that maybe getting a "server-class" motherboard will boost performance substantially. If you're skeptical, that's fine, but I'll cut to the chase -- cluster performance for data replication between the two cluster nodes is within a few percent of the disk's maximum throughput.

With a maximum sustained disk throughput of 50-53 MB/s, this cluster configuration showed a consistent 47 to 50 MB/sec replication speed. Factoring in IP latency, block-copy overhead, and data transfer across the PCI bus from disk to NIC, a few MB/sec performance hit isn't so bad after all. In this example, the customer's 18GB DRBD partition replicated itself between the two servers in about 7 minutes.

The only way to really boost the limiting factors would be a more expensive RAID card (not necessarily just getting a motherboard with one of the onboard RAID chipsets) connected to multiple disks. Since a good 3Ware card and several disks would cost about the same price as the rest of the machine, and a new motherboard would cost substantially more than the Asus P4P800, it is likely to be cost-prohibitive for small business owners to use that. However, for enterprise solutions with dozens or hundreds of users, the cost may be well justified.

I have had opportunity to work with several flavors of gigabit NIC, but I keep coming back to the Intel Pro/1000 MT for both reliability and economy. It's not that I'm an "Intel bigot" or something-- I use DLink, NetGear, Realtek, and 3Com cards in a lot of machines, but not as the NIC on which all the cluster data rides. If you have found a particular flavor of gigabit card that you trust, which has seen a lot of long-term sustained throughput without any hiccups, this cluster configuration will still work well for you.

Software

This cluster relies on several pieces of GPL software to function properly. The full list can be found in the Resources section, but the software is summarized here:

TKCluster -- http://www.solidrocktechnologies.com/clustermanager
openMosix -- http://www.openmosix.org
DRBD -- http://www.linbit.com
Heartbeat -- http://www.linux-ha.org
Various kernel patches

The current version of TKCluster was developed against the above packages, using Fedora Core 1.0 as the base distribution. TKCluster consists of Perl and shell scripts, so anyone familiar with those should be able to edit and alter it as necessary.

Kernel Issues

ICH5/ICH5R Serial ATA

The Intel ICH5/ICH5R offers a substantial improvement over previous ICH chipsets. Although Linux support for the ICH5 SATA features is present in later 2.4 kernels, the canonical kernel driver has some alleged instabilities (I say "alleged" only because I have not fully explored it myself) as well as performance weaknesses. Jeff Garzik has written a nice SATA driver for the ICH5 SATA as well as other VIA and SII chipsets. This comes as a set of kernel patches found at kernel.org:

http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/

Unfortunately, the kernel patches don't seem to be developed against the canonical kernel. Some of the diffs in the file 2.4.22-libata1.patch did not patch cleanly against 2.4.22. However, this is not a showstopper by any means. You can manually patch the files that don't patch.

sk98lin Drivers

There's a fair amount of "fiddling" that had to be done to get the kernel patched. I was enthralled with the amount of "extra" drivers for the gigabit adapters in the Fedora-distributed kernel and so I attempted to develop some patches against the Fedora-distributed 2.4.22 kernel source. However, because it seems to use the new scheduler code, patches for openMosix did not apply cleanly to the Fedora-distributed source. In fact, the patches failed miserably against the scheduler code and resulted in unusable source. Also, because of the difficulties encountered with the libata patches, I was very disappointed not to be able to develop a concise patch for Fedora's 2.4.22 source.

openMosix

I ended up pulling canonical kernel source for 2.4.22 straight from kernel.org, and then patching with openMosix. Once the openMosix patches were applied, I copied the gigabit drivers (notably the sk98lin driver for the 3Com 3c940 adapter built into the P4P800 motherboard) directly from the Fedora source, as well as the files for the libata source. I went through the libata patch file and copied all relevant code into the proper directories. I believe you will also find the latest e1000 driver patches in there, which work with a larger variety of Intel Pro/1000 variants and board revisions than the canonical drivers. If you find that your Pro/1000 is not automatically detected by the driver, try downloading the latest version of the driver directly from Intel and compiling it. New board revisions occur all the time, and the canonical kernel drivers become outdated quickly.

If you are a bit squeamish about this level of patching, I have provided a reference page containing links to the kernel source that I patched. The kernel tarball available on that page contains only the canonical kernel source plus the aforementioned drivers that I copied in from the Fedora source. The ata_piix driver from Jeff Garzik is quite nice and offers tremendous performance. If you are using mod_scsi and, especially if you are using the driver to talk to a CD/DVD burner, pay special attention to that driver. If you accidentally use the "old" driver instead of the new one, your machine is likely to hang on boot when the module is inserted.

Setup and Applications

The easiest way to get the cluster up and running is the following:

1. Assemble all hardware, but put both SATA hard drives into one of the machines. The second hard drive will be available to duplicate the first one onto it when the first is fully configured.

2. Install Fedora Core 1.0 (2.0 is not available in non-beta at the time of this writing, although it probably will be by the time this goes to press).

Make sure you get Perl installed because TKCluster is written mostly in Perl. During installation, set aside at least one large partition that will be used as the "cluster" partition. I set up this particular customer with the following disk layout:

Disk /dev/sda: 37.0 GB, 37019566080 bytes
64 heads, 32 sectors/track, 35304 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sda1   *         1        96     98288   83  Linux
/dev/sda2            97     35304  36052992    5  Extended
/dev/sda5            97      1074   1001456   82  Linux swap
/dev/sda6          1075      2052   1001456   83  Linux
/dev/sda7          2053      5959   4000752   83  Linux
/dev/sda8          5960      7913   2000880   83  Linux
/dev/sda9          7914     17680  10001392   83  Linux
/dev/sda10        17681     35304  18046960   83  Linux

To give you an idea of the layout, here is a snippet from "mount":

/dev/sda6 on / type ext3 (rw)
/dev/sda1 on /boot type ext3 (rw)
/dev/sda9 on /home type ext3 (rw)
/dev/sda7 on /usr type ext3 (rw)
/dev/sda8 on /var type ext3 (rw)

Note that /dev/sda10 is specifically left out. This will be the cluster partition. If you are using a different disk layout, set aside at least one partition where you will be placing the shared data area.

3. Run "up2date" and make sure everything has all the necessary security patches.

4. Download and install the kernel source I have provided on the TKCluster home page, or patch your own kernel as desired if you are using different hardware.

If something goes wrong, do not replace your existing kernel from Fedora, just configure /etc/grub.conf so that you can select the new kernel from the menu and boot it. Once the kernel boots fine and you are comfortable with it, modify /etc/grub.conf so that it boots this new kernel without your selecting it. Make sure you also grab the openMosix user-land tools from their Web site and install them at the same time.

5. You will have to manually adjust /etc/modules.conf if the Fedora install does not automatically detect both the 3c940 and the Intel Pro/1000. If so, you will need to insert the following two lines into /etc/modules.conf, and comment-out any other references to "eth0" and "eth1" elsewhere in the file:

alias eth0 sk98lin
alias eth1 e1000

Note that the Intel Pro/1000 MT uses the "e1000" driver, and that this is specifically set to be "eth1".

Also, make sure that /etc/hosts is set up properly. The proper address setup is important to make the cluster work right. It relies upon /etc/hosts pointing to the right addresses and names in order to use the right interface to send messages:

127.0.0.1               localhost.localdomain localhost
192.168.1.201           cluster1 cluster1.tntreloading.com
192.168.1.202           cluster2 cluster2.tntreloading.com
192.168.1.230           cluster cluster.tntreloading.com
10.0.0.1                cluster1p cluster1p.tntreloading.com
10.0.0.2                cluster2p cluster2p.tntreloading.com

Note the use of "192.168.1.230". This is the IP alias that "floats" between the two cluster nodes. Whichever cluster node is currently the primary owns that IP address as an alias. When the primary goes down, the secondary grabs that address by using the "IPaddr" script from the heartbeat package. Install the "heartbeat" package at this point if you have not already done so.

6. If all has gone well and you have a bootable system that sees its disk properly as /dev/sda, talks to both NICS, and has an entry for /proc/hpc (the openMosix area under /proc), you are now ready to configure the specific features of openMosix. The Intel Pro/1000 card will be used as the cluster NIC, where both openMosix and DRBD will talk to each other. The 3c940 will be used for talking to the rest of the LAN.

The rest of this document will assume that the Intel card on this machine is using the address 10.0.0.1 netmask 255.255.255.0, and the 3c940 is using 192.168.1.201 netmask 255.255.255.0. When the secondary machine is brought up, it will be configured similarly, with the Intel card as eth1 on 10.0.0.2/255.255.255.0 and the 3c940 as eth0 on 192.168.1.202/255.255.255.0. The Intel cards in each machine will be connected directly to one another via an MDI/X "crossover" UTP cable.

7. Make sure /etc/openmosix/openmosix.config has the following parameters turned on:

AUTODISCIF=eth1
MYOMID=1
MFS=yes

8. Create /etc/openmosix.map with the following entries, which tell openMosix to set up 10.0.0.1 as node #1, and 10.0.0.2 as node #2:

1       10.0.0.1        1
2       10.0.0.2        1

9. Download and install DRBD from Linbit. It should be a very quick and easy install. The default configuration is for two DRBD devices; however, you can create up to 255 DRBD devices by altering drbd/drbd_main.c and changing this line:

int minor_count=2;

to whatever number of DRBD devices you want.

10. Alter your iptables configuration to allow DRBD's port 7788/tcp through, and openMosix ports of 5000-5700/udp, 723/tcp, 4660/tcp, and 5428/udp. If you encounter trouble with the firewall, please consult the openMosix FAQ to make sure you have the most recent information on port numbers it uses.

11. If a reboot still results in a usable machine, you should now duplicate the disk onto the second disk, which will go into the other machine, with dd:

dd if=/dev/sda of=/dev/sdb bs=128M

12. When dd is finished, shut down and move /dev/sdb into the other machine. Boot it up, and change its IP addresses to the previously mentioned addresses. Also make sure to alter /etc/openmosix/openmosix.config and set "MYOMID=2" so that openMosix knows it is running on the secondary node.

13. Now, when both machines boot, you will have an openMosix cluster that is capable of sharing process load between them. If you run "mosmon", you should see both "1" and "2" along the bottom edge of the display, meaning that both nodes are up and visible to openMosix.

14. If you have an identical disk layout to the one shown here, format the /dev/sda10 partition on the primary machine with the ext3 filesystem. If you are not using an identical layout, locate the free space you set aside on your disk during install and format it.

Once formatted, I suggest mounting it to /opt. (Make sure /opt is NOT listed in /etc/fstab, as you do not want to let the OS have any direct control over mounting, fsck'ing, etc. that partition. TKCluster must be the only thing controlling access to that partition.) Then, place in it some data such as a PostgreSQL data directory, a MySQL data directory, some Web home directories, or maybe sharing /opt via NFS or Samba. Configure the related servers or their init scripts to point to that directory structure as necessary. You will need to write your own init script for /dev/nb0, modeled after the sample "runme" in the TKCluster package, so that it starts and stops all relevant servers using their own default init scripts.

Because TKCluster will have control over these services and not the OS itself, remove the symlinks in /etc/rc3.d and /etc/rc5.d to those services so that they do not start or stop prematurely. TKCluster will be in charge of making the partition available to the services and then starting the services when ready. Make sure all modifications made outside of /opt are done on both machines.

15. Set up TKCluster on both machines. The default configuration file contained in the package is "cluster.conf" and contains a setup compatible with this example setup. If you use a configuration identical to or very similar to this one, the only parameter you will need to edit will be the "PRIMARYSERVICES" line that references /dev/nb0. Point this to a script that accepts "start" and "stop" parameters on the command line, and that accordingly starts and stops the services that will be accessing the data in the partition "/dev/nb0". Each partition you share between the cluster nodes will have its own start/stop script.

Contained in the package is the script "installme". Once you have edited cluster.conf appropriately and created your start/stop script, just run "./installme" and it will copy the files to the right directories. If you have a version of Perl other than 5.8.3, or if you have installed Perl in somewhere other than /usr, you will need to make some slight changes to the script.

16. Test DRBD to make sure the firewall and DRBD are working properly and that data replication can take place properly. On the primary node:

# modprobe drbd
# drbdsetup /dev/nb0 disk /dev/sda10
# drbdsetup /dev/nb0 net 10.0.0.1:7788 10.0.0.2:7788 C -s 16384 --sync-nice -19
# drbdsetup /dev/nb0 primary

and then on the secondary node:

# modprobe drbd
# drbdsetup /dev/nb0 disk /dev/sda10
# drbdsetup /dev/nb0 net 10.0.0.2:7788 10.0.0.1:7788 C -s 16384 --sync-nice -19

You will now see occasional blinks of the hard drive activity lights in unison on both machines. The default sync rate is only 250KB/sec. To change that to a higher throughput, do this on the primary node:

# drbdsetup /dev/nb0 syncer --max 600000
# drbdsetup /dev/nb0 syncer --min 599999

The hard drive lights on each system should now stay "on" almost continuously. The raw data on the primary system's /dev/sda10 is flowing over the gigabit link to the secondary. To see what kind of rate you are getting, take a look at DRBD's /proc entry:

# cat /proc/drbd

version: 0.6.12 (api:64/proto:62)

0: cs:SyncingAll st:Primary/Secondary ns:474328 nr:18054652 dw:18061700 dr:482233 pe:840 ua:0
        [>...................] sync'ed:  2.7% (17160/17623)M
        finish: 0:07:10h speed: 47,441 (46,201) K/sec
1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0

The line "finish: 0:07:10h..." tells all we need to know about the speed of synchronization. The first number "47,441" is the instantaneous rate of the current transfer in KB/sec. The number in parenthesis "(46,201)" is the average over the time the sync has been running. In this case, the sync'er setting had already been set to a high value before initiating the sync at 250 KB/sec, so the average is very close to the instantaneous. This is generally the case in a failover situation when the secondary takes over the role of primary.

17. We have now verified that both openMosix and drbd are working as desired. At this point, drbd can be stopped and the drbd.o module unloaded on both machines, as TKCluster does this job as part of its startup. After that, TKCluster can be started on the primary machine by running its init script:

# /etc/init.d/storagecluster start

You can watch the progress of TKCluster in /var/log/cluster.log. If everything is set up right and your services are starting properly, you will see something like the following:

.../usr/local/bin/checkinitdead.pl:3992: Starting /usr/local/bin/checkinitdead.pl
.../usr/local/bin/checkinitdead.pl:3992: I am cluster1p, cluster1p is preferred storage master
.../usr/local/bin/checkinitdead.pl:3992: returning immediately to become master
.../usr/local/bin/becomestorage.pl:3997: role is initially set to
.../usr/local/bin/becomestorage.pl:3997: "2" is up
.../usr/local/bin/storaged.pl:3990: /usr/local/bin/storaged.pl starting
.../usr/local/bin/storaged.pl:3990: writing my wait state to/tmp/storaged.waitstatus as "waiting"
.../usr/local/bin/storaged.pl:3990: server started on port 2345
.../usr/local/bin/becomestorage.pl:3997: 10.0.0.2 waitstatus is nowait (returned )
.../usr/local/bin/becomestorage.pl:3997: "10.0.0.2" is not waiting for me, so I'll become primary
.../usr/local/bin/storaged.pl:4011: writing my wait state to /tmp/storaged.waitstatus as "waiting"
.../usr/local/bin/becomestorage.pl:3997: taking cluster IP: /etc/ha.d/resource.d/IPaddr \
  192.168.254.230 start >/dev/null 2>&1
.../usr/local/bin/storaged.pl:4073: returning getwait() with "waiting"
.../usr/local/bin/storaged.pl:4081: writing my wait state to /tmp/storaged.waitstatus as "waiting"
.../usr/local/bin/becomestorage.pl:3997: EXEC: /sbin/e2fsck -p /dev/sda10
.../usr/local/bin/becomestorage.pl:3997: RET: /sbin/e2fsck -p /dev/sda10 0

Each TKCluster process reports date and time (replaced by "..." here), the full pathname, and its PID on each line of the logfile. In many cases where TKCluster is calling an external program, the word "EXEC:" prefixes the command. When the command returns, the message includes "RET:", followed by the command, with the exit status appended to the end (note the "0" on the last line of the above snippet). After this, you would see various "EXEC" and "RET" pairs for loading drbd, configuring the drbd device, and so on.

Now that the primary machine is up, you can run the same init script on the secondary and watch /var/log/cluster.log on the secondary machine as the corresponding processes start there.

Testing

TKCluster is relatively easy to test. While both machines are up and after the initial drbd sync is done, hit the reset button on the secondary while watching /var/log/cluster.log on the primary. You should see no messages appear in /var/log/cluster.log until the secondary machine starts coming back up. When that happens, you will see:

.../usr/local/bin/storaged.pl:15130: returning getwait() with "waiting"

This says that the secondary attempted to contact the primary to inform it that it was coming up. The next few messages will be something like:

.../usr/local/bin/monitorstorage.pl:1471: Status/Role Change from primary to primary syncing
.../usr/local/bin/monitorstorage.pl:1471: Status/Role Change from primary syncing to primary

TKCluster recognizes the role changes and ensures that DRBD is in a state where it will accept connections from the remote side for resynchronization. It is during this that a "fast" sync will occur.

Once the drbd sync is done, hit the reset switch on the primary while watching /var/log/cluster.log on the secondary. You should see something like this:

.../usr/local/bin/monitorstorage.pl:4029: MOSIX not alive on node 1 (cluster1p)

This message will be repeated by one or more "monitorstorage.pl" processes until the number of re-tries configured in cluster.conf has been counted and "becomestorage.pl" is issued with the "grab" parameter, which means to forcibly attempt to reboot the primary node and grab all its services:

.../usr/local/bin/monitorstorage.pl:4029: MOSIX not alive on node 1 (cluster1p)
.../usr/local/bin/monitorstorage.pl:4014: getting out of retry loop
.../usr/local/bin/monitorstorage.pl:4014: Falling out of monitorstorage.pl.  I AM NOT PRIMARY!
.../usr/local/bin/monitorstorage.pl:4014: EXEC: /usr/local/bin/becomestorage.pl grab
.../usr/local/bin/becomestorage.pl:9847: EXEC: /usr/local/bin/getstoragestatus.pl cluster1p reboot &
.../usr/local/bin/becomestorage.pl:9847: RET: /usr/local/bin/getstoragestatus.pl cluster1p reboot & 0
.../usr/local/bin/becomestorage.pl:9847: EXEC: /usr/local/bin/becomestorage.pl start primary
.../usr/local/bin/becomestorage.pl:9851: role is initially set to primary
.../usr/local/bin/becomestorage.pl:9851: taking cluster \
  IP: /etc/ha.d/resource.d/IPaddr 192.168.1.230 start >/dev/null 2>&1
.../usr/local/bin/storaged.pl:9916: writing my wait state to /tmp/storaged.waitstatus as "waiting"

After this point, you will notice that the log entries look just like those of the primary, because now the secondary has assumed the role of primary.

To explain this, TKCluster uses a "preferred storage" and "init dead" design where the cluster does not necessarily assume a single state when booted every time. When booted, the machine that is not set as the "preferred" machine will wait for a period of time known as "init dead". If the "init dead" time passes and the machine that is set as "preferred" has not yet come up, the secondary will assume the worst and take control of the storage cluster. If the machine set as "preferred" starts up after this "init dead" expires, it will assume the role of secondary, despite having been preferred.

If the "preferred" machine starts up before the expiration of the "init dead" time, it will automatically assume its role as the primary, and the non-preferred machine will immediately fall into secondary mode. This allows for "hard" failures of the primary, such as purposely downing it for hardware upgrades, etc., and accounting for the possibility of a power failure during the time while the "preferred" is down.

If the non-preferred machine were always to assume a role of secondary on boot, it would never become primary without being told to do so. So, if the primary machine goes down and stays down, the secondary will take over. If a power outage or other reboot of the remaining "live" machine occurs, no more than the "init dead" time period will pass before it assumes control of the storage. During the "init dead" period, manual intervention can force it to skip the "init dead" delay and go immediately into primary mode.

Conclusion

Affordable, small-scale, high-availability clusters can be built using TKCluster. Some scripting may be necessary to customize TKCluster to your own needs, but the cluster framework is stable and reliable.

Resources

TKCluster Home -- http://www.SolidRockTechnologies.com/clustermanager/

Heartbeat -- http://www.linux-ha.org/download/

openMosix -- http://www.openmosix.org/

drbd -- http://www.linbit.com/ (free registration required)

libata kernel patches -- http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/

Tom Kunz holds a degree in Mechanical Engineering and lives with his amazing wife and children in the Pocono region of northeastern Pennsylvania. He has been involved in Linux and Unix systems administration and programming since about 1995. Tom has recently opened a Web/mail hosting and custom software business and his Web site is http://www.SolidRockTechnologies.com/. He can be reached via email at: tkunz@SolidRockTechnologies.com.