aug2003.tar

Data Replication in High Availability Clusters

Nelson Yount

The ultimate goal of high availability (HA) clustering is to improve the level of availability of applications and other services to their end-users. HA clusters generally consist of two or more computer systems in close physical proximity. The HA clustering software is designed to monitor the status of all of the systems in the cluster, and upon the failure of any of those systems, to restart on one or more of the remaining systems any applications that were running on the failed system. The clustering software will typically also move the network address associated with such an application to the backup system, allowing clients to easily continue to access the application.

Cluster Storage Models

For this application migration to work, it is necessary that the application data be accessible both to the computer system on which the application was running originally (the primary system) and to the system to which the application is moved (the backup system). This is usually accomplished with some form of shared storage, typically either an external storage array (to which both systems are connected by a common SCSI bus or a Fibre Channel network) or a network attached storage (NAS) device. By placing the application data on a shared external storage device, an application has access to its data regardless of whether it is running on the primary or backup system.

Cluster Configurations

The simplest HA cluster configuration consists of two systems, with one system actively running one or more applications, and the second system acting purely as a backup for the first. This configuration is known as a 2-node active/passive cluster. A more common and more useful configuration is the 2-node active/active cluster (see Figure 1). Again, the cluster consists of two systems, but in this case both systems are actively running applications and each is serving as the backup for the other. If either system experiences a failure, the other system in the cluster takes over the applications from the failed system, while continuing to run its own applications.

There are a number of ways to extend a cluster beyond two systems. One of the most common configurations is known as the N+1 cluster. Such a cluster consists of N active systems, each running one or more applications, and a single passive system that serves as the backup for all of the N active systems. This configuration can be thought of as N different 2-node active/passive clusters that all have the same passive backup system. The storage for such a cluster can be shared between all N+1 systems, or each of the N systems can have its own external storage that can also be accessed by the backup system.

All of the cluster configurations discussed thus far involve a simple pairing of systems. Some HA clustering products cannot go beyond this, and in many cases, may be sufficient to meet the HA needs of a given computing environment. But many clustering products today have the added flexibility of going beyond paired failovers. Such products are capable of supporting clusters of more than two systems in which all of the systems have access to the same external storage device. They typically provide features that offer much greater flexibility in the way an application can be migrated in the event of a failure. Products that support multi-directional failover allow the cluster to be configured such that when a given system fails, some of the applications from the failed system can be migrated to one system in the cluster, some to a second, some to a third, and so on. This allows the additional load from the failed system to be better distributed across the remaining systems in the cluster.

More advanced clustering products may also support a feature known as cascading failover. This means that a given application can migrate across more than two systems, allowing it to survive multiple system failures. After an initial system failure in which a given application is migrated to a backup system, if the backup system then also fails, the application can be migrated to yet another backup system. These application failovers can cascade across all the systems in the cluster if necessary, in an attempt to keep the application running and available.

The Data Replication Alternative

All of these HA clustering options depend on the ability of the application to access its data from each of the systems on which the application needs to run. As noted previously, this is almost always accomplished via some form of shared storage. However, there is an additional option for providing application data access on multiple systems, especially in small 2-node clusters. Data replication technology can be used to maintain separate identical local copies of the application data on the two systems.

What Is Data Replication?

Data replication is simply a method of copying data from a source system to a target system via a network, with the intent of maintaining an identical copy of some portion of the source system's data on the target system. This mechanism usually involves the use of an additional driver layer on the source system to intercept writes and propagate the modified data to a corresponding driver or daemon on the target system.

There are several types of data replication available. The most common methods operate at the block, filesystem, or file level. A block-level replication mechanism operates by intercepting writes at the disk block level and replicating those changes to the target system. This mechanism works transparently to the filesystem or application that is using the underlying disk volume. A filesystem-level replication scheme intercepts writes just above the filesystem layer and is therefore specific to the underlying filesystem. File-level replication replicates changes to a specific file or group of files on the source system.

Besides these forms of replication, some database engines include built-in replication capabilities for maintaining an identical copy of a database on a second system. Some high-end external storage systems have the ability to perform their own replication to a second storage system, avoiding any involvement by the systems that are using the storage.

Data replication also has three primary modes of operation: synchronous, asynchronous, and periodic. In the synchronous mode, an application's write request does not complete until the data has been successfully written on both the source and target systems. In the asynchronous mode, the application's write request results in having the data written on the source system and queued for transmission to the target system. The queued data update will be then transmitted to the remote target system as system resources and network bandwidth allow. The periodic mode of replication simply transmits data changes on the source system at predetermined periodic intervals in a batch-like manner. Data replication products may support any or all of these modes of operation.

Data Replication and HA Clusters

To better understand the use of data replication in a HA cluster, consider the case of a 2-node active/passive cluster running a single application (see Figure 2). When the application is active on the primary system, all updates to the application data are automatically replicated to the backup system by the data replication mechanism. When a failure occurs and the application is moved to the backup system, it continues its operations using the mirrored data residing on the backup system. When the primary system is returned to service, the direction of the data replication is reversed, such that all data updates on the backup system are replicated to the primary system, after an initial resynchronization process to bring the primary system up to date with any data changes that may have occurred while it was unavailable.

Data replication can generally be used with each of the other cluster configurations discussed previously, although each of those configuration types requires certain capabilities that not all data replication products possess. Data replication can be used in 2-node active/active clusters, but the replication mechanism must be capable of supporting simultaneous replication in both directions between the two systems. The N+1 cluster configuration requires that the data replication mechanism be capable of supporting multiple (N) source systems replicating data to a single target system. Multi-directional failover can be configured if the replication product can support a single source system replicating multiple data volumes, each with a potentially different target system. Cascading failover can be supported only if the data replication mechanism has the ability to replicate a single data volume on the source system to multiple target systems.

Data Replication Drawbacks

The use of data replication in HA clusters is not without challenges. A number of issues must be considered before deciding to use data replication rather than some form of shared storage. Perhaps the first issue to consider is performance. Because of the additional overhead required to transmit data updates to a remote target system for every write request by an application, the synchronous mode of data replication can have severe performance impacts on write-intensive applications. Both the asynchronous and periodic modes of replication can alleviate much of this performance concern, but they both present an increased risk of data loss. If the primary system in a cluster fails at a point in time when there is application data that has not yet been replicated to the backup system, that data will be lost when the application is migrated to the backup system.

Data replication also presents a requirement for data resynchronization that is not present with the shared storage form of HA clustering. After a primary system has failed, a failover has been performed, and the primary system has been restored to service, a data resynchronization process must occur to bring the data on the primary system up to date with the backup system. This process must include any data changes that have occurred after the application began running on the backup. Until this resynchronization process has been completed, to avoid data loss, the application cannot be returned to the primary system (either manually or automatically) in response to a failure of the backup system. It is usually left to the HA clustering software to enforce this restriction. During this time period, the application has no HA protection at all.

The most straightforward method of performing data resynchronization is to retransmit the entire volume of data to the out-of-date target system. For large data volumes, this can be quite time consuming, requiring hours or even days. Some mechanism is needed that will allow resynchronization to be performed more quickly and efficiently. This requirement may be met by either transaction logging or intent logging. Transaction logging is where every individual data update is recorded in a log that can be replayed in the case of a failure. Intent logging uses a bitmap or other structure to record which blocks of data have been modified, such that resynchronization can be accomplished simply by replicating only those blocks that are known to have changed.

HA clusters that are built using data replication are also more susceptible to a clustering issue called the "split-brain" problem. This problem arises when a failure occurs in the communication infrastructure between the systems in the cluster. If two clustered systems lose their ability to communicate with one another, they are both led to believe that the other system has somehow failed and will attempt to perform a failover operation. In a shared storage cluster, the results can be disastrous, with the application running on both systems and both performing writes to the shared data. This will almost certainly result in serious data corruption.

Fortunately, shared storage also can be used as an advantage to help avoid this problem. Various HA clustering products use different means, but those available include SCSI reservations, communication paths through the shared storage device itself, and disk-based quorum mechanisms. However, in clusters based purely on data replication, none of these techniques are available. Thus, it is crucial to heed the advice of nearly all clustering software vendors to establish multiple independent paths of communication between the systems in your cluster.

When Is Data Replication Appropriate?

Despite the challenges described above, there are a number of scenarios and factors that can justify the use of data replication in a HA cluster. If the amount of shared application data is very small, or if the shared application data is primarily read-only, many of the described problems are minimized and data replication may be quite acceptable.

Data replication may also be considered if cost is a major factor. Shared storage devices can be quite expensive, especially in a large deployment of HA clusters such as in a retail environment with numerous identical sites. The administration of shared storage can also be costly, requiring additional IT staff with the expertise necessary to deal with the increased complexity of shared storage issues. Data replication can be an effective means of avoiding these additional costs when implementing a HA cluster.

If the requirements for a given application or environment already call for multiple copies of the application data, perhaps for backup or other purposes, data replication can accomplish both that goal and the shared data access requirement for high availability.

Shared storage typically imposes certain distance limitations between the systems and the external storage system. Data replication may be the only available solution for clustering when the distance between systems precludes the use of shared storage.

Disaster Recovery Solutions

Taking the distance issue above to the extreme, disaster recovery solutions typically involve the movement of data across large distances. Traditional disaster recovery solutions involve replicating the most critical components of a company's data center at a remote location. Systems and business-critical applications are available in a standby mode at the remote site ready to come into service if needed. Backups of company data are made each night and shipped either to the remote site or to some secure off-site storage facility where they can be retrieved if needed. Optionally, business-critical data can be replicated in real-time to the remote site through the use of data replication techniques like those discussed above.

Like disaster recovery solutions, HA clustering products aim to provide continued access to applications and data in the face of certain abnormal conditions. Traditional HA clustering is generally focused on conditions such as the failures of individual systems, applications, or application resources, not complete site failures. But the concepts and mechanisms of HA clustering, if combined with the right technology, can be an alternative solution to the disaster recovery problem. By combining clustering technology for high availability with data replication technology for disaster recovery, it is possible to achieve an answer to both problems with a single well-integrated solution.

Imagine a 2-node active/passive cluster consisting of systems A and B, using shared storage. System A is the primary system for a single application, and system B is configured as the backup. The application data is stored on the shared storage array. But suppose that the shared application data could also be replicated to a remote system C, using data replication software running on the primary system A. The result is a 3-node HA cluster (see Figure 3), consisting of two local systems with shared storage, and a third remote system receiving data updates via data replication over a wide area network (WAN).

In this example cluster, if system A fails and system B remains healthy, the HA clustering software will migrate the protected application to system B. This is normal behavior for a 2-node HA cluster. In this case, however, the data replication will also be reconfigured such that system B sends all application data updates to system C.

The value in combining HA clustering with data replication to a remote site becomes clear when you consider the question, "What happens if system B fails before system A can be restored to service?" Or even more significant, "What happens if both system A and system B fail simultaneously due to a site disaster?"

Because system C is included in the cluster as defined by the HA software, and because it has access to the application data, the protected application can be migrated to system C with little or no interruption in service. This automatically managed application migration is significantly faster and easier than the process involved in a traditional disaster recovery solution. The return to normal service is faster and easier, because the HA software will automatically handle the reversal of the data replication from system C back to either system A or B when they are returned to service. It will also allow the administrator to perform an easy migration of the application back to one of those local systems when desired.

Disaster Recovery Technical Requirements

To support such a configuration, a number of technical requirements must be met by the HA clustering software and data replication mechanisms that make up the overall solution. These requirements are described below.

Asynchronous Replication

If the synchronous mode of replication were used in a disaster recovery solution, the additional network latency of the WAN usually would make the impact on the write performance of the application intolerable. So for these solutions, data replication should be done asynchronously, where the data is written immediately on the local system and queued for transmission to the remote backup system as network bandwidth allows.

Fast Resynchronization

The time required to perform data resynchronization by copying the complete volume of data to the target system would be completely unacceptable in the WAN environment. So a fast resynchronization mechanism, using either transaction logging or intent logging, is a necessity in a disaster recovery solution. These techniques allow data resynchronization to be accomplished by transmitting a much smaller portion of the data over the network.

Network Address Migration Across the WAN

A significant component of the migration of an application from one system to another in a HA cluster is the migration of the network address associated with the application. This allows clients to continue to connect to the application, many times in a completely seamless manner. The migration of a network address is accomplished rather easily in a LAN, in which both the primary and backup systems are on the same subnetwork. But this is typically not the case in a disaster recovery solution, in which the primary and backup systems are on physically disjointed networks, making it impossible to move a network address to a subnetwork in which it does not belong.

Recent advances in network technology, however, have made this problem solvable. Virtual LAN (VLAN) technology makes it possible to create a logical subnetwork that spans multiple physical networks. With the right networking hardware, VLANs can be established across wide area connections between routers, even routers connected via Virtual Private Networks (VPNs). As long as the mechanism used by the HA clustering software to force updates of network address to hardware address mapping tables works properly in the VLAN environment, this problem can be solved without additional software features.

Integration of Shared Storage Cluster with Data Replication

Not all HA clustering software products can implement the combination of a local shared storage cluster and a remote system updated via data replication. Such a configuration requires two capabilities that not all HA clustering products include. Most fundamentally, the product must be able to support clusters of more than two systems. The product must also be able to support the combination of shared storage and data replication for a single protected application. Both of these features have significant implications upon the basic architecture of the HA software.

Summary

Data replication is an attractive alternative for providing application data access on multiple systems in HA clusters. It is particularly useful in environments where its potential drawbacks are minimized by the nature of the applications or data to be protected, or where the cost savings over a shared storage solution outweigh other negatives. When extended to the problem of disaster recovery, the combination of HA clustering software and data replication techniques can be used to build a solution that far surpasses traditional disaster recovery solutions.

Nelson Yount is one of the original employees of SteelEye Technology, Inc. (http://www.steeleye.com) and serves as the Chief Technology Officer to provide vision and guidance to the company on product technological advancements, as well as define the architectural direction for SteelEye LifeKeeper business continuity solutions. For more than 17 years, Nelson has been designing and implementing clustering and systems management solutions for Windows, Linux, and Unix environments.