TKCluster
Tom Kunz
Linux has shown a lot of growth in the area of data-centric, high-availability
clustering. Most admins are already familiar with computational
clusters, known loosely as Beowulf clusters, which are implemented
in the form of MPI, PVM, LAM, MOSIX, and other process-sharing and
process-distributing technologies. There are also "Web service clusters",
such as those distributed in years past by TurboLinux and others.
These were typically groups of similarly configured servers that
used DNS and round-robin IP address tricks to give the illusion
of Web server high-availability to end users.
Cohesive operation between the nodes, however, was still only
achieved through a shared-storage medium, such as Fibre Channel
or shared SCSI, which are prohibitively expensive for small businesses,
or proprietary cluster hardware and software, which is also prohibitively
expensive. A database engine that serves a Web cluster must still
itself be clustered to achieve true high-availability. Application-level
high-availability tools (such as the MySQL database engine) that
transparently replicate themselves between servers, are also being
used to provide some level of redundancy.
The one area in which Linux still starves for attention is in
the realm of lightweight, easily configured, affordable high-availability
-- a general-purpose cluster. A general-purpose, high availability
cluster must be "application agnostic" -- it should not care what
runs on it, whether it be Web server, mail server, database server,
or any future, yet-unknown type of service. The cluster should give
a uniform style of operation no matter what application is running.
In response to this, I have written TKCluster (when I initially
wrote it, I couldn't think of a good name for it, so I just prefixed
"cluster" with my initials). TKCluster is available under the GPL
so that anyone can freely download and modify it to suit their needs.
Overview
TKCluster itself is a cluster manager. Raw data replication between
nodes is performed by the wonderful DRBD driver by Phillipp Reisner.
DRBD is a block device that maps to a given raw disk partition and
a socket. Writes to the DRBD device (/dev/nb0 .. /dev/nbX) go to
both the physical disk in the local machine, as well as to the waiting
secondary node over a standard Ethernet connection. All clusters
require some kind of "heartbeat" mechanism. After experimenting
with various ones, I chose openMosix. openMosix was designed to
share computation-intensive process loads between multiple machines;
however, I have yet to find anything that does as good a job at
maintaining a frequently updated list of connected machines.
The process-load sharing and MFS filesystem (analogous to traditional
NFS, but infinitely smarter) make openMosix a perfect candidate
for helping to tie the cluster nodes together. Although MFS is not
necessary for TKCluster operation, it sure helps when copying configuration
files around between the nodes of the cluster. TKCluster's role
is to use the data gleaned from both openMosix and DRBD to make
decisions about starting services, seizing control of the cluster
IP address, and keeping the sister copy of TKCluster on the other
node of the cluster aware of what's going on.
Design
TKCluster is intended for use in clusters where data is to be
shared between two nodes, a primary and a secondary. The data that
is "shared" between them lives on one or more partitions, each partition
having its own DRBD device. These partitions should be separate
from the system partitions (/, /boot, /usr, /var, etc.), because
the secondary node will have no access to these partitions until
the primary dies (and the secondary seizes control) or the primary
gives up control of the partition. The secondary simply waits for
something to happen and then acts accordingly to take control of
the partition and restart the services previously served by the
primary.
These DRBD devices are "raw" devices, they need not have a filesystem
on them. Database engines such as Oracle or Informix can be configured
to use "raw" partitions. Because DRBD is a block device driver,
it simply passes raw block writes through to the local disk. The
application running on top of it has no concern about the underlying
device, as long as requests are satisfied and the driver behaves
as a block device driver should. A DRBD device can also contain
a traditional filesystem such as ext3, and be mounted in the usual
fashion. DRBD allows the data sharing to be thoroughly agnostic
of the apps that talk to it.
When the cluster is fully configured properly and initially powered
up, the primary will talk to the secondary and push its data over
to the secondary as needed. DRBD has the capability to do a full
synchronization, meaning a direct, block-for-block copy of the entire
partition, or of doing a "fast" synchronization, where it only copies
over changes it finds. For the most part, a cold boot of the cluster
will always result in a full sync, while momentary loss of connectivity
between primary and secondary may result in "fast" syncs.
Once the sync is done, the secondary's partition will have all
the data of the primary's. All future writes on the primary will
flow over to the secondary and be committed there as well.
While up, openMosix will be active on both machines and will maintain
its own list of what machines are currently up. The process-sharing
capabilities of openMosix may be used, although most servers do
not cause excessive computational loads. openMosix judiciously migrates
processes that are strictly computational, and generally keeps I/O-bound
processes local to the machine on which they were started. The main
purpose of openMosix will be to actively collect availability data
so that the secondary node of the cluster can accurately and reliably
detect when a primary node dies, and then grab the services of the
primary accordingly.
Because DRBD is intended for duplication of data between only
two nodes, scaling the data high availability of TKCluster beyond
two nodes is not possible. However, thanks to openMosix, if TKCluster
is to be used to service a computational cluster, additional "openMosix-only"
nodes can be added to the cluster. These nodes would not be running
DRBD, and would only participate in the computationally intensive
processes in the cluster. The computational horsepower of the cluster
can be multiplied as needed by adding more and more "openMosix-only"
nodes to the cluster, although this is generally only needed in
scientific and research-oriented endeavors.
Hardware
TKCluster is intended to enable small- to medium-sized businesses
to gain the benefits of data high availability without paying large
sums for commercial clustering hardware and software. As such, this
article will highlight the installation and usage of TKCluster as
it is currently installed at one of my customer's sites. The hardware
used in this example is not the cheapest of the cheap, nor is it
the most expensive. It is all fairly standard, commonly available
PC hardware.
If you choose to duplicate this particular installation today,
depending on your hardware vendor, initial hardware outlay will
be less than $2000. Most likely it will be substantially less. Even
lower-powered machines are quite usable with TKCluster, however,
my customer specifically chose this configuration because of previous
experience with the same hardware. I am a small business owner,
and I have designed TKCluster so that others in a similar situation
should be able to use it as well.
Two machines, identically configured in hardware in BIOS, are
preferred to make life easy on the administrator. It's possible
to have very different hardware and chipsets between the two nodes
of the cluster, however, I strongly recommend identical hardware
configurations. This will vastly simplify the administrative overhead
if anything is ever changed inside the servers. The hardware selected
for each machine for this configuration consists of the following:
- Asus P4P800 motherboard (includes 3Com 3c940 gigabit LAN and
Intel 865G/ICH5 chipset)
- 512MB Crucial PC-3200 RAM (2 x 256M DIMMs)
- Western Digital 36GB SATA "Raptor" HD
- Intel 2.6 GHz CPU, 800 MHz FSB
- Intel Pro/1000 MT gigabit NIC
- A generic case, power supply, video card, CD-ROM, and floppy
Hardware gurus familiar with Intel products will immediately notice
that the motherboard and NIC are not traditionally considered "server-class",
because of the lack of the faster PCI-X architecture. However, in
this configuration, specifically aiming to be affordable for the
small business owner, it will become apparent as to why a PCI-X
motherboard and PCI-X version of the Intel Pro/1000 gigabit card
are not completely necessary.
The limiting factor of this particular configuration is not necessarily
the PCI bus. Major performance benefits will not be realized by
simply adding PCI-X to the configuration. If you bring up one of
the two nodes of the cluster with Fedora Core 1.0, the libata drivers
from Fedora will address the SATA hard drive as /dev/sda. Once installed
and running, you can run hdparm -Tt /dev/sda and get something
like the following:
# hdparm -Tt /dev/sda
/dev/sda:
Timing buffer-cache reads: 3076 MB in 2.00 seconds = 1538.00 MB/sec
Timing buffered disk reads: 160 MB in 3.03 seconds = 52.81 MB/sec
Repeated executions of hdparm will average something about 1500-1700
MB/sec cached, and 50-53 MB/sec uncached. The maximum sustained throughput
of this particular disk is about 53 MB/sec. In the grand scheme of
things, virtually all hard drives, whether ATA, SATA, or SCSI, have
a maximum sustained throughput of between 40 and 60 MB/sec, so this
particular disk is in the upper-middle range of performance. There
are other articles that analyze the performance throughput of various
configurations, but without going up to a RAID configuration using
much more expensive controller cards and multiple disks, the best
single-disk throughput realizable with today's hard-drive technology
is a sustained 60 MB/sec or so. For small business owners, where every
penny counts, cost is a compelling reason to stay with a single disk
on the built-in controller and not opt for a more expensive RAID controller
connected to several disks.
Some users may see the glossy advertisements of "320 MB/sec SCSI"
and start thinking their systems simply must have it. However, remember
that the advertised 320 MB/sec is not the sustained throughput;
it is only the speed at which data moves from the host controller
SCSI card to the disk's electronics and into the cache on the disk
itself. SATA disk caches are typically 8M and only the most expensive
SCSI disks have 16M of cache onboard. The actual cache-to-platter
speed is the important speed; it is that speed that limits sustained
throughput to the disk.
While this particular disk is physically limited to about 53 MB/sec,
the theoretical limit of a gigabit NIC, connected to the fastest
possible PCI slot, is 125 MB/sec. The NIC in this is not the limiting
factor for duplicating data between the nodes of the cluster; the
disk is still the limiting factor. The practical limits of a NIC
on a plain PCI bus (not PCI-X) are still only a bit faster than
that of the hard drive itself.
The choice of a cheaper motherboard and no RAID controller sounds
logical in a theoretical sense -- it seems to work out on paper
that the limiting factor will be the hard drive itself. Increasing
disk performance could be a major expense, which could prohibit
small businesses from wanting to buy into an expensive pair of machines.
But some are still likely to be skeptical, believing that maybe
getting a "server-class" motherboard will boost performance substantially.
If you're skeptical, that's fine, but I'll cut to the chase -- cluster
performance for data replication between the two cluster nodes is
within a few percent of the disk's maximum throughput.
With a maximum sustained disk throughput of 50-53 MB/s, this cluster
configuration showed a consistent 47 to 50 MB/sec replication speed.
Factoring in IP latency, block-copy overhead, and data transfer
across the PCI bus from disk to NIC, a few MB/sec performance hit
isn't so bad after all. In this example, the customer's 18GB DRBD
partition replicated itself between the two servers in about 7 minutes.
The only way to really boost the limiting factors would be a more
expensive RAID card (not necessarily just getting a motherboard
with one of the onboard RAID chipsets) connected to multiple disks.
Since a good 3Ware card and several disks would cost about the same
price as the rest of the machine, and a new motherboard would cost
substantially more than the Asus P4P800, it is likely to be cost-prohibitive
for small business owners to use that. However, for enterprise solutions
with dozens or hundreds of users, the cost may be well justified.
I have had opportunity to work with several flavors of gigabit
NIC, but I keep coming back to the Intel Pro/1000 MT for both reliability
and economy. It's not that I'm an "Intel bigot" or something-- I
use DLink, NetGear, Realtek, and 3Com cards in a lot of machines,
but not as the NIC on which all the cluster data rides. If you have
found a particular flavor of gigabit card that you trust, which
has seen a lot of long-term sustained throughput without any hiccups,
this cluster configuration will still work well for you.
Software
This cluster relies on several pieces of GPL software to function
properly. The full list can be found in the Resources section, but
the software is summarized here:
- TKCluster -- http://www.solidrocktechnologies.com/clustermanager
- openMosix -- http://www.openmosix.org
- DRBD -- http://www.linbit.com
- Heartbeat -- http://www.linux-ha.org
- Various kernel patches
The current version of TKCluster was developed against the above
packages, using Fedora Core 1.0 as the base distribution. TKCluster
consists of Perl and shell scripts, so anyone familiar with those
should be able to edit and alter it as necessary.
Kernel Issues
ICH5/ICH5R Serial ATA
The Intel ICH5/ICH5R offers a substantial improvement over previous
ICH chipsets. Although Linux support for the ICH5 SATA features is
present in later 2.4 kernels, the canonical kernel driver has some
alleged instabilities (I say "alleged" only because I have not fully
explored it myself) as well as performance weaknesses. Jeff Garzik
has written a nice SATA driver for the ICH5 SATA as well as other
VIA and SII chipsets. This comes as a set of kernel patches found
at kernel.org:
http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/
Unfortunately, the kernel patches don't seem to be developed against
the canonical kernel. Some of the diffs in the file 2.4.22-libata1.patch
did not patch cleanly against 2.4.22. However, this is not a showstopper
by any means. You can manually patch the files that don't patch.
sk98lin Drivers
There's a fair amount of "fiddling" that had to be done to
get the kernel patched. I was enthralled with the amount of "extra"
drivers for the gigabit adapters in the Fedora-distributed kernel
and so I attempted to develop some patches against the Fedora-distributed
2.4.22 kernel source. However, because it seems to use the new scheduler
code, patches for openMosix did not apply cleanly to the Fedora-distributed
source. In fact, the patches failed miserably against the scheduler
code and resulted in unusable source. Also, because of the difficulties
encountered with the libata patches, I was very disappointed not
to be able to develop a concise patch for Fedora's 2.4.22 source.
openMosix
I ended up pulling canonical kernel source for 2.4.22 straight
from kernel.org, and then patching with openMosix. Once the openMosix
patches were applied, I copied the gigabit drivers (notably the
sk98lin driver for the 3Com 3c940 adapter built into the P4P800
motherboard) directly from the Fedora source, as well as the files
for the libata source. I went through the libata patch file and
copied all relevant code into the proper directories. I believe
you will also find the latest e1000 driver patches in there, which
work with a larger variety of Intel Pro/1000 variants and board
revisions than the canonical drivers. If you find that your Pro/1000
is not automatically detected by the driver, try downloading the
latest version of the driver directly from Intel and compiling it.
New board revisions occur all the time, and the canonical kernel
drivers become outdated quickly.
If you are a bit squeamish about this level of patching, I
have provided a reference page containing links to the kernel source
that I patched. The kernel tarball available on that page contains
only the canonical kernel source plus the aforementioned drivers
that I copied in from the Fedora source. The ata_piix driver from
Jeff Garzik is quite nice and offers tremendous performance. If
you are using mod_scsi and, especially if you are using the driver
to talk to a CD/DVD burner, pay special attention to that driver.
If you accidentally use the "old" driver instead of the new one,
your machine is likely to hang on boot when the module is inserted.
Setup and Applications
The easiest way to get the cluster up and running is the following:
1. Assemble all hardware, but put both SATA hard drives into
one of the machines. The second hard drive will be available to
duplicate the first one onto it when the first is fully configured.
2. Install Fedora Core 1.0 (2.0 is not available in non-beta
at the time of this writing, although it probably will be by the
time this goes to press).
Make sure you get Perl installed because TKCluster is written
mostly in Perl. During installation, set aside at least one large
partition that will be used as the "cluster" partition. I set up
this particular customer with the following disk layout:
Disk /dev/sda: 37.0 GB, 37019566080 bytes
64 heads, 32 sectors/track, 35304 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 96 98288 83 Linux
/dev/sda2 97 35304 36052992 5 Extended
/dev/sda5 97 1074 1001456 82 Linux swap
/dev/sda6 1075 2052 1001456 83 Linux
/dev/sda7 2053 5959 4000752 83 Linux
/dev/sda8 5960 7913 2000880 83 Linux
/dev/sda9 7914 17680 10001392 83 Linux
/dev/sda10 17681 35304 18046960 83 Linux
To give you an idea of the layout, here is a snippet from "mount":
/dev/sda6 on / type ext3 (rw)
/dev/sda1 on /boot type ext3 (rw)
/dev/sda9 on /home type ext3 (rw)
/dev/sda7 on /usr type ext3 (rw)
/dev/sda8 on /var type ext3 (rw)
Note that /dev/sda10 is specifically left out. This will be the cluster
partition. If you are using a different disk layout, set aside at
least one partition where you will be placing the shared data area.
3. Run "up2date" and make sure everything has all the necessary
security patches.
4. Download and install the kernel source I have provided on
the TKCluster home page, or patch your own kernel as desired if
you are using different hardware.
If something goes wrong, do not replace your existing kernel
from Fedora, just configure /etc/grub.conf so that you can select
the new kernel from the menu and boot it. Once the kernel boots
fine and you are comfortable with it, modify /etc/grub.conf so that
it boots this new kernel without your selecting it. Make sure you
also grab the openMosix user-land tools from their Web site and
install them at the same time.
5. You will have to manually adjust /etc/modules.conf if the
Fedora install does not automatically detect both the 3c940 and
the Intel Pro/1000. If so, you will need to insert the following
two lines into /etc/modules.conf, and comment-out any other references
to "eth0" and "eth1" elsewhere in the file:
alias eth0 sk98lin
alias eth1 e1000
Note that the Intel Pro/1000 MT uses the "e1000" driver, and that
this is specifically set to be "eth1".
Also, make sure that /etc/hosts is set up properly. The proper
address setup is important to make the cluster work right. It relies
upon /etc/hosts pointing to the right addresses and names in order
to use the right interface to send messages:
127.0.0.1 localhost.localdomain localhost
192.168.1.201 cluster1 cluster1.tntreloading.com
192.168.1.202 cluster2 cluster2.tntreloading.com
192.168.1.230 cluster cluster.tntreloading.com
10.0.0.1 cluster1p cluster1p.tntreloading.com
10.0.0.2 cluster2p cluster2p.tntreloading.com
Note the use of "192.168.1.230". This is the IP alias that "floats"
between the two cluster nodes. Whichever cluster node is currently
the primary owns that IP address as an alias. When the primary goes
down, the secondary grabs that address by using the "IPaddr" script
from the heartbeat package. Install the "heartbeat" package at this
point if you have not already done so.
6. If all has gone well and you have a bootable system that
sees its disk properly as /dev/sda, talks to both NICS, and has
an entry for /proc/hpc (the openMosix area under /proc), you are
now ready to configure the specific features of openMosix. The Intel
Pro/1000 card will be used as the cluster NIC, where both openMosix
and DRBD will talk to each other. The 3c940 will be used for talking
to the rest of the LAN.
The rest of this document will assume that the Intel card on
this machine is using the address 10.0.0.1 netmask 255.255.255.0,
and the 3c940 is using 192.168.1.201 netmask 255.255.255.0. When
the secondary machine is brought up, it will be configured similarly,
with the Intel card as eth1 on 10.0.0.2/255.255.255.0 and the 3c940
as eth0 on 192.168.1.202/255.255.255.0. The Intel cards in each
machine will be connected directly to one another via an MDI/X "crossover"
UTP cable.
7. Make sure /etc/openmosix/openmosix.config has the following
parameters turned on:
AUTODISCIF=eth1
MYOMID=1
MFS=yes
8. Create /etc/openmosix.map with the following entries, which tell
openMosix to set up 10.0.0.1 as node #1, and 10.0.0.2 as node #2:
1 10.0.0.1 1
2 10.0.0.2 1
9. Download and install DRBD from Linbit. It should be a very quick
and easy install. The default configuration is for two DRBD devices;
however, you can create up to 255 DRBD devices by altering drbd/drbd_main.c
and changing this line:
int minor_count=2;
to whatever number of DRBD devices you want.
10. Alter your iptables configuration to allow DRBD's port
7788/tcp through, and openMosix ports of 5000-5700/udp, 723/tcp,
4660/tcp, and 5428/udp. If you encounter trouble with the firewall,
please consult the openMosix FAQ to make sure you have the most
recent information on port numbers it uses.
11. If a reboot still results in a usable machine, you should
now duplicate the disk onto the second disk, which will go into
the other machine, with dd:
dd if=/dev/sda of=/dev/sdb bs=128M
12. When dd is finished, shut down and move /dev/sdb into the other
machine. Boot it up, and change its IP addresses to the previously
mentioned addresses. Also make sure to alter /etc/openmosix/openmosix.config
and set "MYOMID=2" so that openMosix knows it is running on the secondary
node.
13. Now, when both machines boot, you will have an openMosix
cluster that is capable of sharing process load between them. If
you run "mosmon", you should see both "1" and "2" along the bottom
edge of the display, meaning that both nodes are up and visible
to openMosix.
14. If you have an identical disk layout to the one shown here,
format the /dev/sda10 partition on the primary machine with the
ext3 filesystem. If you are not using an identical layout, locate
the free space you set aside on your disk during install and format
it.
Once formatted, I suggest mounting it to /opt. (Make sure /opt
is NOT listed in /etc/fstab, as you do not want to let the OS have
any direct control over mounting, fsck'ing, etc. that partition.
TKCluster must be the only thing controlling access to that partition.)
Then, place in it some data such as a PostgreSQL data directory,
a MySQL data directory, some Web home directories, or maybe sharing
/opt via NFS or Samba. Configure the related servers or their init
scripts to point to that directory structure as necessary. You will
need to write your own init script for /dev/nb0, modeled after the
sample "runme" in the TKCluster package, so that it starts and stops
all relevant servers using their own default init scripts.
Because TKCluster will have control over these services and
not the OS itself, remove the symlinks in /etc/rc3.d and /etc/rc5.d
to those services so that they do not start or stop prematurely.
TKCluster will be in charge of making the partition available to
the services and then starting the services when ready. Make sure
all modifications made outside of /opt are done on both machines.
15. Set up TKCluster on both machines. The default configuration
file contained in the package is "cluster.conf" and contains a setup
compatible with this example setup. If you use a configuration identical
to or very similar to this one, the only parameter you will need
to edit will be the "PRIMARYSERVICES" line that references /dev/nb0.
Point this to a script that accepts "start" and "stop" parameters
on the command line, and that accordingly starts and stops the services
that will be accessing the data in the partition "/dev/nb0". Each
partition you share between the cluster nodes will have its own
start/stop script.
Contained in the package is the script "installme". Once you
have edited cluster.conf appropriately and created your start/stop
script, just run "./installme" and it will copy the files to the
right directories. If you have a version of Perl other than 5.8.3,
or if you have installed Perl in somewhere other than /usr, you
will need to make some slight changes to the script.
16. Test DRBD to make sure the firewall and DRBD are working
properly and that data replication can take place properly. On the
primary node:
# modprobe drbd
# drbdsetup /dev/nb0 disk /dev/sda10
# drbdsetup /dev/nb0 net 10.0.0.1:7788 10.0.0.2:7788 C -s 16384 --sync-nice -19
# drbdsetup /dev/nb0 primary
and then on the secondary node:
# modprobe drbd
# drbdsetup /dev/nb0 disk /dev/sda10
# drbdsetup /dev/nb0 net 10.0.0.2:7788 10.0.0.1:7788 C -s 16384 --sync-nice -19
You will now see occasional blinks of the hard drive activity lights
in unison on both machines. The default sync rate is only 250KB/sec.
To change that to a higher throughput, do this on the primary node:
# drbdsetup /dev/nb0 syncer --max 600000
# drbdsetup /dev/nb0 syncer --min 599999
The hard drive lights on each system should now stay "on" almost continuously.
The raw data on the primary system's /dev/sda10 is flowing over the
gigabit link to the secondary. To see what kind of rate you are getting,
take a look at DRBD's /proc entry:
# cat /proc/drbd
version: 0.6.12 (api:64/proto:62)
0: cs:SyncingAll st:Primary/Secondary ns:474328 nr:18054652 dw:18061700 dr:482233 pe:840 ua:0
[>...................] sync'ed: 2.7% (17160/17623)M
finish: 0:07:10h speed: 47,441 (46,201) K/sec
1: cs:Unconfigured st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
The line "finish: 0:07:10h..." tells all we need to know about the
speed of synchronization. The first number "47,441" is the instantaneous
rate of the current transfer in KB/sec. The number in parenthesis
"(46,201)" is the average over the time the sync has been running.
In this case, the sync'er setting had already been set to a high value
before initiating the sync at 250 KB/sec, so the average is very close
to the instantaneous. This is generally the case in a failover situation
when the secondary takes over the role of primary.
17. We have now verified that both openMosix and drbd are working
as desired. At this point, drbd can be stopped and the drbd.o module
unloaded on both machines, as TKCluster does this job as part of
its startup. After that, TKCluster can be started on the primary
machine by running its init script:
# /etc/init.d/storagecluster start
You can watch the progress of TKCluster in /var/log/cluster.log. If
everything is set up right and your services are starting properly,
you will see something like the following:
.../usr/local/bin/checkinitdead.pl:3992: Starting /usr/local/bin/checkinitdead.pl
.../usr/local/bin/checkinitdead.pl:3992: I am cluster1p, cluster1p is preferred storage master
.../usr/local/bin/checkinitdead.pl:3992: returning immediately to become master
.../usr/local/bin/becomestorage.pl:3997: role is initially set to
.../usr/local/bin/becomestorage.pl:3997: "2" is up
.../usr/local/bin/storaged.pl:3990: /usr/local/bin/storaged.pl starting
.../usr/local/bin/storaged.pl:3990: writing my wait state to/tmp/storaged.waitstatus as "waiting"
.../usr/local/bin/storaged.pl:3990: server started on port 2345
.../usr/local/bin/becomestorage.pl:3997: 10.0.0.2 waitstatus is nowait (returned )
.../usr/local/bin/becomestorage.pl:3997: "10.0.0.2" is not waiting for me, so I'll become primary
.../usr/local/bin/storaged.pl:4011: writing my wait state to /tmp/storaged.waitstatus as "waiting"
.../usr/local/bin/becomestorage.pl:3997: taking cluster IP: /etc/ha.d/resource.d/IPaddr \
192.168.254.230 start >/dev/null 2>&1
.../usr/local/bin/storaged.pl:4073: returning getwait() with "waiting"
.../usr/local/bin/storaged.pl:4081: writing my wait state to /tmp/storaged.waitstatus as "waiting"
.../usr/local/bin/becomestorage.pl:3997: EXEC: /sbin/e2fsck -p /dev/sda10
.../usr/local/bin/becomestorage.pl:3997: RET: /sbin/e2fsck -p /dev/sda10 0
Each TKCluster process reports date and time (replaced by "..." here),
the full pathname, and its PID on each line of the logfile. In many
cases where TKCluster is calling an external program, the word "EXEC:"
prefixes the command. When the command returns, the message includes
"RET:", followed by the command, with the exit status appended to
the end (note the "0" on the last line of the above snippet). After
this, you would see various "EXEC" and "RET" pairs for loading drbd,
configuring the drbd device, and so on.
Now that the primary machine is up, you can run the same init
script on the secondary and watch /var/log/cluster.log on the secondary
machine as the corresponding processes start there.
Testing
TKCluster is relatively easy to test. While both machines are
up and after the initial drbd sync is done, hit the reset button
on the secondary while watching /var/log/cluster.log on the primary.
You should see no messages appear in /var/log/cluster.log until
the secondary machine starts coming back up. When that happens,
you will see:
.../usr/local/bin/storaged.pl:15130: returning getwait() with "waiting"
This says that the secondary attempted to contact the primary to inform
it that it was coming up. The next few messages will be something
like:
.../usr/local/bin/monitorstorage.pl:1471: Status/Role Change from primary to primary syncing
.../usr/local/bin/monitorstorage.pl:1471: Status/Role Change from primary syncing to primary
TKCluster recognizes the role changes and ensures that DRBD is in
a state where it will accept connections from the remote side for
resynchronization. It is during this that a "fast" sync will occur.
Once the drbd sync is done, hit the reset switch on the primary
while watching /var/log/cluster.log on the secondary. You should
see something like this:
.../usr/local/bin/monitorstorage.pl:4029: MOSIX not alive on node 1 (cluster1p)
This message will be repeated by one or more "monitorstorage.pl" processes
until the number of re-tries configured in cluster.conf has been counted
and "becomestorage.pl" is issued with the "grab" parameter, which
means to forcibly attempt to reboot the primary node and grab all
its services:
.../usr/local/bin/monitorstorage.pl:4029: MOSIX not alive on node 1 (cluster1p)
.../usr/local/bin/monitorstorage.pl:4014: getting out of retry loop
.../usr/local/bin/monitorstorage.pl:4014: Falling out of monitorstorage.pl. I AM NOT PRIMARY!
.../usr/local/bin/monitorstorage.pl:4014: EXEC: /usr/local/bin/becomestorage.pl grab
.../usr/local/bin/becomestorage.pl:9847: EXEC: /usr/local/bin/getstoragestatus.pl cluster1p reboot &
.../usr/local/bin/becomestorage.pl:9847: RET: /usr/local/bin/getstoragestatus.pl cluster1p reboot & 0
.../usr/local/bin/becomestorage.pl:9847: EXEC: /usr/local/bin/becomestorage.pl start primary
.../usr/local/bin/becomestorage.pl:9851: role is initially set to primary
.../usr/local/bin/becomestorage.pl:9851: taking cluster \
IP: /etc/ha.d/resource.d/IPaddr 192.168.1.230 start >/dev/null 2>&1
.../usr/local/bin/storaged.pl:9916: writing my wait state to /tmp/storaged.waitstatus as "waiting"
After this point, you will notice that the log entries look just like
those of the primary, because now the secondary has assumed the role
of primary.
To explain this, TKCluster uses a "preferred storage" and "init
dead" design where the cluster does not necessarily assume a single
state when booted every time. When booted, the machine that is not
set as the "preferred" machine will wait for a period of time known
as "init dead". If the "init dead" time passes and the machine that
is set as "preferred" has not yet come up, the secondary will assume
the worst and take control of the storage cluster. If the machine
set as "preferred" starts up after this "init dead" expires, it
will assume the role of secondary, despite having been preferred.
If the "preferred" machine starts up before the expiration
of the "init dead" time, it will automatically assume its role as
the primary, and the non-preferred machine will immediately fall
into secondary mode. This allows for "hard" failures of the primary,
such as purposely downing it for hardware upgrades, etc., and accounting
for the possibility of a power failure during the time while the
"preferred" is down.
If the non-preferred machine were always to assume a role of
secondary on boot, it would never become primary without being told
to do so. So, if the primary machine goes down and stays down, the
secondary will take over. If a power outage or other reboot of the
remaining "live" machine occurs, no more than the "init dead" time
period will pass before it assumes control of the storage. During
the "init dead" period, manual intervention can force it to skip
the "init dead" delay and go immediately into primary mode.
Conclusion
Affordable, small-scale, high-availability clusters can be
built using TKCluster. Some scripting may be necessary to customize
TKCluster to your own needs, but the cluster framework is stable
and reliable.
Resources
TKCluster Home -- http://www.SolidRockTechnologies.com/clustermanager/
Heartbeat -- http://www.linux-ha.org/download/
openMosix -- http://www.openmosix.org/
drbd -- http://www.linbit.com/ (free registration required)
libata kernel patches -- http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/
Tom Kunz holds a degree in Mechanical Engineering and lives
with his amazing wife and children in the Pocono region of northeastern
Pennsylvania. He has been involved in Linux and Unix systems administration
and programming since about 1995. Tom has recently opened a Web/mail
hosting and custom software business and his Web site is http://www.SolidRockTechnologies.com/.
He can be reached via email at: tkunz@SolidRockTechnologies.com. |