Linux
High Availability Clusters with Heartbeat, DRBD, and DRBDLinks
Sean Reifschneider
Linux clusters using Heartbeat and DRBD allow High Availability
(HA) clusters to be created very inexpensively. In the past, HA
clusters typically required a standalone RAID array (preferably
Fibre Channel) in addition to the pair of servers. Now, for a fraction
of the cost of a standalone RAID array and using entirely free software,
an HA cluster can be built with Heartbeat and DRBD.
While HA clusters can be attractive because of their increased
resiliency, they definitely increase management overhead. Not only
do you have to update software on two machines, there's added complexity
associated with each system. HA systems must also be tested, preferably
regularly. As Alan Robertson, leader of the Heartbeat project says,
"An HA system which isn't tested will surprise you someday."
One of the tools we've created in our work verifying and setting
up clusters for clients is meant to help address the management
complexities of HA clusters using DRBD and Heartbeat. In this article,
I will first provide an overview of HA clustering using DRBD and
Heartbeat, and then demonstrate the use of DRBDLinks.
HA Clustering Overview
High Availability (HA) clusters involve a pair of machines acting
as a single, more resilient machine. One machine is the primary
and the other stands by, monitoring the primary. In the event of
any sort of failure of the primary machine's hardware or software,
the secondary machine will take over the application. In most cases,
this is more like a fast reboot than a seamless, uninterrupted operation.
Note that this is different from "Compute Clusters", in which
many machines are used to work on parts of a very big problem. HA
clusters will not provide a performance increase over running on
a single machine.
The primary benefit of an HA cluster is that the downtime exposure
is limited. If a power supply fails on one machine at 3 a.m., the
services will be backed up and running on the spare within seconds
or minutes without any human interaction. Also, you can do hardware
and software maintenance, even those involving extended downtime,
while suffering only a very limited unavailability. Depending on
your data and applications, this downtime may be a few seconds or
a few minutes.
Obviously, both machines must have access to the data required
to run the applications, as well as network connectivity and other
related resources. One common way of implementing shared data is
to set up an external RAID array to which both machines can be connected.
The machines then arbitrate their access to that resource to ensure
that only one is mounting the file system at any given time. Another,
less expensive option, is to use DRBD, which I will describe in
detail later.
The applications in the cluster are brought up and down using
"start" and "stop" scripts. A typical start script would need to:
- Run "fsck" on the file system. In the case of a hardware failure
on the primary node, the file system would be left in an unclean
state. Obviously, a journaling file system is preferable so that
fsck takes seconds instead of (possibly tens of) minutes.
- Mount the shared file system.
- Set up any crontabs that are related to the application. Cron
jobs that require access to the shared data can't just be dropped
into the standard crontab. They must be set up to run only on
the active node.
- Start any applications that provide access to the shared data.
For example, Web servers, databases, email (POP, IMAP, SMTP) servers,
etc.
The "stop" script is usually just the inverse of the start script.
Heartbeat
Heartbeat is a program that runs on both machines and elects one
node as primary. Once up, the secondary machine monitors the primary
and, in the event of a failure, will change the secondary node to
primary. Heartbeat runs scripts on the primary node to bring the
applications up and shut them down. In that way, it's similar to
a multi-machine "init" process.
The machines in the cluster communicate with each other by sending
"heartbeat packets" about twice a second. Typically you would configure
the heartbeats to run over multiple different paths, for example,
via serial and network links. Missing a few seconds of these heartbeats
could trigger a failover, so you need to make sure that rebooting
a switch or unplugging a network cable will not trigger a failover.
Heartbeat provides standard "start" and "stop" scripts for doing
many common tasks such as bringing up and down IP addresses or mounting
partitions. It's worth noting that the IP address scripts will send
out gratuitous ARP packets to alert other devices on the network
of the topology change. System start/stop scripts, typically found
in "/etc/init.d", can also be directly used with Heartbeat. For
more complex tasks, custom scripts can be called to start or stop
parts of the application.
Another part of Heartbeat is called STONITH, short for Shoot The
Other Node In The Head. In the case of a primary node failure, the
secondary node uses STONITH to ensure that the old primary is definitely
down before bringing up the services on the secondary node. STONITH
uses serial or network-accessible power switches to power-cycle
the other node, thereby ensuring that the cluster isn't running
as a "split brain".
The so-called "split brain" operation can happen if there is a
problem in configuration or the Heartbeat communication paths. A
split brain occurs when both machines in the cluster believe they
are primary and try to run the application. This often leads to
undesired results, such as multiple machines having the same IP
address or file systems on shared devices becoming corrupted.
DRBD
As mentioned in the introduction, one mechanism for sharing data
between two machines is to use an external RAID array. The primary
drawback to this is cost, with typical array configurations costing
no less than $2,000.
DRBD is a "Distributed Replicated Block Device" that allows similar
results to be achieved on local discs using a network connection
for replication. DRBD can be thought of as a RAID-1 (mirrored drives)
system that mirrors a local hard drive with a drive on another computer.
In fact, early clusters were set up using the Linux "md" RAID-1
and "nbd" network block device drivers to literally mirror a local
drive with one on a remote system. That solution didn't prove to
be very robust, however.
DRBD includes mechanisms for tracking which system has the most
recent data, "change logs" to allow a fast partial re-sync, and
startup scripts that reduce the likelihood that a system will come
up in "split brain" operation.
We typically set up a dedicated network using a direct cross-over
network connection between the machines. With Gigabit Ethernet,
the network has more bandwidth than a single typical IDE drive does.
We've sync'ed 70 GB of data in about 20 minutes, or around 55MB/sec.
However, in most cases, you don't need anywhere near this much bandwidth.
A 100-mbps network, which can handle up to around 10 MB/second,
will work for most applications.
Sample Applications
There are a number of applications where this sort of high-availability
cluster is typically used:
- NFS servers, especially with "email toasters" -- The NFS protocol
is truly stateless. A pair of NFS servers can be set up and, in
the event of a failure of the primary, the secondary comes up
in its place. In our experience, these failovers result in data
access on the clients hanging for a few seconds and then continuing
as if nothing had happened. One particular use of this is having
a number of inexpensive machines running POP, IMAP, and SMTP with
email data residing on the NFS server.
- Load balancers -- A cluster of email servers, as mentioned
above, would often have a load balancer in front if it that makes
the cluster of machines look like a single larger system. If this
load balancer goes down, all access to the systems behind the
load balancer is interrupted. For this reason, load balancers
are often set up as a pair of machines in a HA cluster.
- Web servers -- For a business that relies on its Web presence,
an HA cluster makes a lot of sense. While HA clusters are no replacement
for good monitoring, maintenance, and backup practices, they can
help ensure that your Web site remains available even in the event
of software and hardware problems.
Deciding to Implement Linux-HA
It's important to understand what is involved with setting up
an HA cluster before you decide to use one. There is much more involved
in it than simply installing and configuring a couple more packages.
In fact, it's quite a lot more complicated than maintaining two
standalone computers.
At the outset, you should realize that an HA cluster may increase
the downtime of your applications. Really. The added complexity
of an HA cluster can make it easier for operational mistakes to
cause outages. As Tandem Computer Company found in their studies
in the 1980s and 1990s, when their software and hardware became
more resilient, operational errors dominated the causes of outages.
One large cause of downtime in HA clusters is regular testing.
It's important to verify that changes you have made to the system
do not impact its ability to failover. It's equally important for
your operations staff to remain familiar with the commands and processes
used in the HA cluster and its failover. Expect to run failover
testing at least once a quarter and possibly monthly or more if
you are making changes to the systems. Never trust your business
to an untested HA cluster.
Why bother with an HA cluster if it may increase your downtime?
The downtime that a cluster causes is scheduled downtime, and therefore
tends to be much less impacting than unplanned downtime. For scheduled
outages, you would usually have the resources on hand and ready
to resolve any problems in the event that the failover were to take
more than the usual few seconds or minutes.
In contrast, imagine that an unscheduled outage occurs -- a RAM
module fails at 3 a.m. on a Monday morning. It could take 15 minutes
to potentially hours before the outage is noticed, and more minutes/hours
to get staff in place to diagnose and repair (assuming replacement
parts are easily available). An HA cluster could have had the application
back up within a minute.
A HA cluster is, quite simply, risk management. It may lead to
more scheduled downtime, but can save hours or even days of downtime
in the event of a failure. An HA cluster ensures that most unplanned
outages are limited to under a minute of impact. Without such a
cluster, outages may last an unpredictable period of time.
Introducing DRBDLinks
In a typical system running DRBD, there will be many directories
and files that reside on the shared data partition. There are typically
two ways to handle this data in an HA cluster:
1. Change the startup/shutdown scripts such that applications
directly access their configuration and data on the shared partition.
In this case, the normal configuration and data files continue to
exist in their regular system locations, but those files do not
control the operation of the services. This can lead to confusion
as operations staff cannot maintain the systems in the same way
that they would other systems.
2. Link the normal system files and directories into the shared
partition. This means that configuration file and data reside in
their familiar locations on the primary system. However, links must
be set up when the service starts on the primary and returned to
normal when the service is not operating. Leaving the links set
up when the application is not running will often cause package
updates to fail and other problems.
These links can be maintained in the heartbeat startup and shutdown
scripts using the standard ln, mv, and rm commands.
DRBDLinks automates this task using a simple configuration file.
Included is an "init.d" script to make sure that on boot the links
are set in their normal state. On reboot after a hard failover,
this will correct the links on the system. DRBDLinks would also
be listed in the heartbeat "haresources" script, which causes heartbeat
to set up the links as part of the application startup and clears
them on shutdown.
DRBDLinks Configuration
For example, let's take a Web server using a PostgreSQL database
to make available using heartbeat and DRBD. The DRBD shared partition
will be mounted on "/shared". The "/shared" partition will mimic
the layout of the root partition, meaning that the following directories
contain the clustered files: "/shared/etc", "/shared/var/www", "/shared/var/lib/pgsql",
and so on.
The first step in the DRBDLinks configuration is to install the
"drbdlinks" package. The DRBDLinks Web page (see the References
section) includes links to the software source as well as RPM and
Source RPM packages. For a Fedora Core 3 or similar system, DRBDLinks
can be installed by running the following command as root:
BASE=ftp://ftp.tummy.com/pub/tummy/drbdlinks/
rpm -ivh "$BASE"/drbdlinks-1.01-1.noarch.rpm
The "/etc/drbdlinks.conf" file now needs to be configured. The sample
"drbdlinks.conf" included with the standard DRBDLinks packages lists
example settings and documentation on how to use the configuration
file. We'll use the following "drbdlinks.conf" for our example system:
mountpoint('/shared')
link('/etc/httpd')
link('/var/log/httpd')
link('/var/www')
link('/var/lib/pgsql')
Next, the above directories need to be set up in the shared partition.
DRBD needs to be started and its partition mounted under "/shared"
if DRBD is used. DRBDLinks can be used with an external RAID array
equally effectively. Please refer to the DRBD documentation or Linux
documentation if more information is needed on mounting these partitions.
To create and populate these directories, we'll use the data in
the local system to prime the directories with the following commands.
Note that these commands shut down the PostgreSQL database to ensure
that a good copy of the data is made:
service postgresql stop
mkdir -p /shared/etc/httpd
mkdir -p /shared/var/log/httpd
mkdir -p /shared/var/www
mkdir -p /shared/var/lib/pgsql
cd /etc/httpd
cp -av . /shared/etc/httpd/
cd /var/log/httpd
cp -av . /shared/var/log/httpd
cd /var/www
cp -av . /shared/var/www
cd /var/lib/pgsql
cp -av . /shared/var/lib/pgsql
Finally, the heartbeat resource can be configured to start and stop
DRBDLinks by modifying the "/etc/ha.d/haresources" file. In this example,
we have the node named "ha1" acting as the default server, providing
an application IP address of "10.0.0.1" and running startup scripts
for Apache and PostgreSQL. The "/etc/ha.d/haresources" file would
contain:
ha1 IPaddr::10.0.0.1 drbddisk::drbd0 drbdlinks postgresql httpd
This will cause DRBDLinks to be run automatically as part of the heartbeat
start and stop process. You can manually test the results by running
"/usr/sbin/drbdlinks start" and "/usr/sbin/drbdlinks stop". Note that
DRBDLinks verifies the current status of the system, so it is safe
to run "start" and "stop" multiple times. It will not result in links
being made or reset multiple times.
For example, after running the above "start" command, the results
would look like:
[root@ha1 root]# ls -ld /etc/httpd*
lrwxrwxrwx 1 root root 17 Dec 29 13:14 /etc/httpd -> /shared/etc/httpd
drwxr-xr-x 4 root root 4096 Nov 15 16:24 /etc/httpd.drbdlinks
DRBDLinks moves the system "httpd" file to the "httpd.drbdlinks" and
makes a link to the version in "/shared". After running the "stop"
command, the directory would be returned to normal:
[root@ha1 root]# ls -ld /etc/httpd*
drwxr-xr-x 4 root root 4096 Nov 15 16:24 /etc/httpd
That's all there is to it. For more information on the options that
can be used, run /usr/sbin/drbdlinks --help:
usage: drbdlinks (start|stop|auto)
options:
-h, --help show this help message and exit
-cCONFIGFILE, --config-file=CONFIGFILE
Location of the configuration file.
-sSUFFIX, --suffix=SUFFIX
Name to append to the local file-system name when
the link is in place.
-v, --verbose Increase verbosity level by 1 for every "-v".
The auto option detects whether the shared directory is mounted
and runs "start" if it is and "stop" if it is not.
Conclusion
Linux high-availability clusters can provide significant advantages
at reasonable cost. There are significant operational advantages
to keeping configuration and data in the normal system locations.
DRBDLinks can be used to easily automate creating and cleaning up
these links into the shared partition. Despite its name, DRBDLinks
can be used equally well with DRBD or an external RAID array.
References
DRBDLinks homepage and download -- http://www.tummy.com/Community/software/drbdlinks/
DRBD homepage and download -- http://www.drbd.org/
Linux-HA project, including links to the Heartbeat package --
http://www.linux-ha.org/
Sean Reifschneider is co-founder and a member of the technical
staff at tummy.com, ltd. With tummy.com, he helps provide Linux-based
solutions to clients and has participated in the Linux and open
source communities since 1995. For more of his writing on technical
topics, see: http://www.tummy.com/journals/users/jafo. He
can be reached at: jafo@tummy.com. |