Article

may2005.tar

Linux High Availability Clusters with Heartbeat, DRBD, and DRBDLinks

Sean Reifschneider

Linux clusters using Heartbeat and DRBD allow High Availability (HA) clusters to be created very inexpensively. In the past, HA clusters typically required a standalone RAID array (preferably Fibre Channel) in addition to the pair of servers. Now, for a fraction of the cost of a standalone RAID array and using entirely free software, an HA cluster can be built with Heartbeat and DRBD.

While HA clusters can be attractive because of their increased resiliency, they definitely increase management overhead. Not only do you have to update software on two machines, there's added complexity associated with each system. HA systems must also be tested, preferably regularly. As Alan Robertson, leader of the Heartbeat project says, "An HA system which isn't tested will surprise you someday."

One of the tools we've created in our work verifying and setting up clusters for clients is meant to help address the management complexities of HA clusters using DRBD and Heartbeat. In this article, I will first provide an overview of HA clustering using DRBD and Heartbeat, and then demonstrate the use of DRBDLinks.

HA Clustering Overview

High Availability (HA) clusters involve a pair of machines acting as a single, more resilient machine. One machine is the primary and the other stands by, monitoring the primary. In the event of any sort of failure of the primary machine's hardware or software, the secondary machine will take over the application. In most cases, this is more like a fast reboot than a seamless, uninterrupted operation.

Note that this is different from "Compute Clusters", in which many machines are used to work on parts of a very big problem. HA clusters will not provide a performance increase over running on a single machine.

The primary benefit of an HA cluster is that the downtime exposure is limited. If a power supply fails on one machine at 3 a.m., the services will be backed up and running on the spare within seconds or minutes without any human interaction. Also, you can do hardware and software maintenance, even those involving extended downtime, while suffering only a very limited unavailability. Depending on your data and applications, this downtime may be a few seconds or a few minutes.

Obviously, both machines must have access to the data required to run the applications, as well as network connectivity and other related resources. One common way of implementing shared data is to set up an external RAID array to which both machines can be connected. The machines then arbitrate their access to that resource to ensure that only one is mounting the file system at any given time. Another, less expensive option, is to use DRBD, which I will describe in detail later.

The applications in the cluster are brought up and down using "start" and "stop" scripts. A typical start script would need to:

Run "fsck" on the file system. In the case of a hardware failure on the primary node, the file system would be left in an unclean state. Obviously, a journaling file system is preferable so that fsck takes seconds instead of (possibly tens of) minutes.
Mount the shared file system.
Set up any crontabs that are related to the application. Cron jobs that require access to the shared data can't just be dropped into the standard crontab. They must be set up to run only on the active node.
Start any applications that provide access to the shared data. For example, Web servers, databases, email (POP, IMAP, SMTP) servers, etc.

The "stop" script is usually just the inverse of the start script.

Heartbeat

Heartbeat is a program that runs on both machines and elects one node as primary. Once up, the secondary machine monitors the primary and, in the event of a failure, will change the secondary node to primary. Heartbeat runs scripts on the primary node to bring the applications up and shut them down. In that way, it's similar to a multi-machine "init" process.

The machines in the cluster communicate with each other by sending "heartbeat packets" about twice a second. Typically you would configure the heartbeats to run over multiple different paths, for example, via serial and network links. Missing a few seconds of these heartbeats could trigger a failover, so you need to make sure that rebooting a switch or unplugging a network cable will not trigger a failover.

Heartbeat provides standard "start" and "stop" scripts for doing many common tasks such as bringing up and down IP addresses or mounting partitions. It's worth noting that the IP address scripts will send out gratuitous ARP packets to alert other devices on the network of the topology change. System start/stop scripts, typically found in "/etc/init.d", can also be directly used with Heartbeat. For more complex tasks, custom scripts can be called to start or stop parts of the application.

Another part of Heartbeat is called STONITH, short for Shoot The Other Node In The Head. In the case of a primary node failure, the secondary node uses STONITH to ensure that the old primary is definitely down before bringing up the services on the secondary node. STONITH uses serial or network-accessible power switches to power-cycle the other node, thereby ensuring that the cluster isn't running as a "split brain".

The so-called "split brain" operation can happen if there is a problem in configuration or the Heartbeat communication paths. A split brain occurs when both machines in the cluster believe they are primary and try to run the application. This often leads to undesired results, such as multiple machines having the same IP address or file systems on shared devices becoming corrupted.

DRBD

As mentioned in the introduction, one mechanism for sharing data between two machines is to use an external RAID array. The primary drawback to this is cost, with typical array configurations costing no less than $2,000.

DRBD is a "Distributed Replicated Block Device" that allows similar results to be achieved on local discs using a network connection for replication. DRBD can be thought of as a RAID-1 (mirrored drives) system that mirrors a local hard drive with a drive on another computer. In fact, early clusters were set up using the Linux "md" RAID-1 and "nbd" network block device drivers to literally mirror a local drive with one on a remote system. That solution didn't prove to be very robust, however.

DRBD includes mechanisms for tracking which system has the most recent data, "change logs" to allow a fast partial re-sync, and startup scripts that reduce the likelihood that a system will come up in "split brain" operation.

We typically set up a dedicated network using a direct cross-over network connection between the machines. With Gigabit Ethernet, the network has more bandwidth than a single typical IDE drive does. We've sync'ed 70 GB of data in about 20 minutes, or around 55MB/sec. However, in most cases, you don't need anywhere near this much bandwidth. A 100-mbps network, which can handle up to around 10 MB/second, will work for most applications.

Sample Applications

There are a number of applications where this sort of high-availability cluster is typically used:

NFS servers, especially with "email toasters" -- The NFS protocol is truly stateless. A pair of NFS servers can be set up and, in the event of a failure of the primary, the secondary comes up in its place. In our experience, these failovers result in data access on the clients hanging for a few seconds and then continuing as if nothing had happened. One particular use of this is having a number of inexpensive machines running POP, IMAP, and SMTP with email data residing on the NFS server.
Load balancers -- A cluster of email servers, as mentioned above, would often have a load balancer in front if it that makes the cluster of machines look like a single larger system. If this load balancer goes down, all access to the systems behind the load balancer is interrupted. For this reason, load balancers are often set up as a pair of machines in a HA cluster.
Web servers -- For a business that relies on its Web presence, an HA cluster makes a lot of sense. While HA clusters are no replacement for good monitoring, maintenance, and backup practices, they can help ensure that your Web site remains available even in the event of software and hardware problems.

Deciding to Implement Linux-HA

It's important to understand what is involved with setting up an HA cluster before you decide to use one. There is much more involved in it than simply installing and configuring a couple more packages. In fact, it's quite a lot more complicated than maintaining two standalone computers.

At the outset, you should realize that an HA cluster may increase the downtime of your applications. Really. The added complexity of an HA cluster can make it easier for operational mistakes to cause outages. As Tandem Computer Company found in their studies in the 1980s and 1990s, when their software and hardware became more resilient, operational errors dominated the causes of outages.

One large cause of downtime in HA clusters is regular testing. It's important to verify that changes you have made to the system do not impact its ability to failover. It's equally important for your operations staff to remain familiar with the commands and processes used in the HA cluster and its failover. Expect to run failover testing at least once a quarter and possibly monthly or more if you are making changes to the systems. Never trust your business to an untested HA cluster.

Why bother with an HA cluster if it may increase your downtime? The downtime that a cluster causes is scheduled downtime, and therefore tends to be much less impacting than unplanned downtime. For scheduled outages, you would usually have the resources on hand and ready to resolve any problems in the event that the failover were to take more than the usual few seconds or minutes.

In contrast, imagine that an unscheduled outage occurs -- a RAM module fails at 3 a.m. on a Monday morning. It could take 15 minutes to potentially hours before the outage is noticed, and more minutes/hours to get staff in place to diagnose and repair (assuming replacement parts are easily available). An HA cluster could have had the application back up within a minute.

A HA cluster is, quite simply, risk management. It may lead to more scheduled downtime, but can save hours or even days of downtime in the event of a failure. An HA cluster ensures that most unplanned outages are limited to under a minute of impact. Without such a cluster, outages may last an unpredictable period of time.

Introducing DRBDLinks

In a typical system running DRBD, there will be many directories and files that reside on the shared data partition. There are typically two ways to handle this data in an HA cluster:

1. Change the startup/shutdown scripts such that applications directly access their configuration and data on the shared partition. In this case, the normal configuration and data files continue to exist in their regular system locations, but those files do not control the operation of the services. This can lead to confusion as operations staff cannot maintain the systems in the same way that they would other systems.

2. Link the normal system files and directories into the shared partition. This means that configuration file and data reside in their familiar locations on the primary system. However, links must be set up when the service starts on the primary and returned to normal when the service is not operating. Leaving the links set up when the application is not running will often cause package updates to fail and other problems.

These links can be maintained in the heartbeat startup and shutdown scripts using the standard ln, mv, and rm commands. DRBDLinks automates this task using a simple configuration file.

Included is an "init.d" script to make sure that on boot the links are set in their normal state. On reboot after a hard failover, this will correct the links on the system. DRBDLinks would also be listed in the heartbeat "haresources" script, which causes heartbeat to set up the links as part of the application startup and clears them on shutdown.

DRBDLinks Configuration

For example, let's take a Web server using a PostgreSQL database to make available using heartbeat and DRBD. The DRBD shared partition will be mounted on "/shared". The "/shared" partition will mimic the layout of the root partition, meaning that the following directories contain the clustered files: "/shared/etc", "/shared/var/www", "/shared/var/lib/pgsql", and so on.

The first step in the DRBDLinks configuration is to install the "drbdlinks" package. The DRBDLinks Web page (see the References section) includes links to the software source as well as RPM and Source RPM packages. For a Fedora Core 3 or similar system, DRBDLinks can be installed by running the following command as root:

BASE=ftp://ftp.tummy.com/pub/tummy/drbdlinks/
rpm -ivh "$BASE"/drbdlinks-1.01-1.noarch.rpm

The "/etc/drbdlinks.conf" file now needs to be configured. The sample "drbdlinks.conf" included with the standard DRBDLinks packages lists example settings and documentation on how to use the configuration file. We'll use the following "drbdlinks.conf" for our example system:

mountpoint('/shared')
link('/etc/httpd')
link('/var/log/httpd')
link('/var/www')
link('/var/lib/pgsql')

Next, the above directories need to be set up in the shared partition. DRBD needs to be started and its partition mounted under "/shared" if DRBD is used. DRBDLinks can be used with an external RAID array equally effectively. Please refer to the DRBD documentation or Linux documentation if more information is needed on mounting these partitions.

To create and populate these directories, we'll use the data in the local system to prime the directories with the following commands. Note that these commands shut down the PostgreSQL database to ensure that a good copy of the data is made:

service postgresql stop
mkdir -p /shared/etc/httpd
mkdir -p /shared/var/log/httpd
mkdir -p /shared/var/www
mkdir -p /shared/var/lib/pgsql
cd /etc/httpd
cp -av . /shared/etc/httpd/
cd /var/log/httpd
cp -av . /shared/var/log/httpd
cd /var/www
cp -av . /shared/var/www
cd /var/lib/pgsql
cp -av . /shared/var/lib/pgsql

Finally, the heartbeat resource can be configured to start and stop DRBDLinks by modifying the "/etc/ha.d/haresources" file. In this example, we have the node named "ha1" acting as the default server, providing an application IP address of "10.0.0.1" and running startup scripts for Apache and PostgreSQL. The "/etc/ha.d/haresources" file would contain:

ha1 IPaddr::10.0.0.1 drbddisk::drbd0 drbdlinks postgresql httpd

This will cause DRBDLinks to be run automatically as part of the heartbeat start and stop process. You can manually test the results by running "/usr/sbin/drbdlinks start" and "/usr/sbin/drbdlinks stop". Note that DRBDLinks verifies the current status of the system, so it is safe to run "start" and "stop" multiple times. It will not result in links being made or reset multiple times.

For example, after running the above "start" command, the results would look like:

[root@ha1 root]# ls -ld /etc/httpd*
lrwxrwxrwx  1 root root   17 Dec 29 13:14 /etc/httpd -> /shared/etc/httpd
drwxr-xr-x  4 root root 4096 Nov 15 16:24 /etc/httpd.drbdlinks

DRBDLinks moves the system "httpd" file to the "httpd.drbdlinks" and makes a link to the version in "/shared". After running the "stop" command, the directory would be returned to normal:

[root@ha1 root]# ls -ld /etc/httpd*
drwxr-xr-x  4 root root 4096 Nov 15 16:24 /etc/httpd

That's all there is to it. For more information on the options that can be used, run /usr/sbin/drbdlinks --help:

usage: drbdlinks (start|stop|auto)

options:
  -h, --help       show this help message and exit
  -cCONFIGFILE, --config-file=CONFIGFILE
                   Location of the configuration file.
  -sSUFFIX, --suffix=SUFFIX
                   Name to append to the local file-system name when
                   the link is in place.
  -v, --verbose    Increase verbosity level by 1 for every "-v".

The auto option detects whether the shared directory is mounted and runs "start" if it is and "stop" if it is not.

Conclusion

Linux high-availability clusters can provide significant advantages at reasonable cost. There are significant operational advantages to keeping configuration and data in the normal system locations. DRBDLinks can be used to easily automate creating and cleaning up these links into the shared partition. Despite its name, DRBDLinks can be used equally well with DRBD or an external RAID array.

References

DRBDLinks homepage and download -- http://www.tummy.com/Community/software/drbdlinks/

DRBD homepage and download -- http://www.drbd.org/

Linux-HA project, including links to the Heartbeat package -- http://www.linux-ha.org/

Sean Reifschneider is co-founder and a member of the technical staff at tummy.com, ltd. With tummy.com, he helps provide Linux-based solutions to clients and has participated in the Linux and open source communities since 1995. For more of his writing on technical topics, see: http://www.tummy.com/journals/users/jafo. He can be reached at: jafo@tummy.com.