may2005.tar

High Availability Clustering with Veritas Cluster Server

Paul Guglielmino

When one of your systems goes down, don't you wish that your users wouldn't notice? With the aid of highly available clusters, your users can continue working as if nothing had happened, and the systems administrator can deal with the problem undisturbed. This article provides an introduction to basic clustering concepts with Veritas Cluster Server (VCS) using Sun Solaris as the sample platform. VCS can run on a wide verity of Unix and Windows systems and the principles are the same on all systems, but the implementation may differ slightly from what is presented here. Knowledge of Veritas Volume Manager is not required to benefit from this article, but may be helpful since VCS and volume manager are tightly integrated.

Introduction

In its "Cluster Server 3.5 User's Guide (Solaris)", Veritas defines a cluster as "multiple systems connected with a dedicated communications infrastructure". In practice, this means a set of servers with a shared storage infrastructure that act together to provide a service or set of services. VCS clusters can have from 1 to 32 servers, or nodes, and can run a variety of applications from databases to Web servers.

There are several design models for a cluster. Basic designs are active/passive, active/active, and N-to-1. An active/passive cluster consists of two servers -- one running your application and one sitting idle waiting for a failover event. An active/active cluster has portions of services running on both nodes with either node capable of taking over all services in a failover event. On the other hand, an N-to-1 configuration allows multiple servers to be backed up by just one server. This configuration provides the 100% redundancy of the active/passive model without the hardware costs.

Beyond basic design models exist N+1 and N-to-N. An N+1 configuration has one more additional server than is needed. Any service can run on any of the cluster nodes. The N-to-N configuration is the active/passive model on a larger scale. Multiple servers each have one failover partner. In practice, active/passive, active/active, and N+1 are the most common design methods depending on the applications involved.

Here is some important terminology and how it relates to VCS:

Heartbeat: Heartbeats are a communication mechanism for nodes to exchange information concerning hardware and software status, keep track of cluster membership, and keep this information synchronized across all cluster nodes. The heartbeats can be passed over a shared disk or over a network link. If a shared disk is used, there is a limit of eight nodes in the cluster, and you need to plan for possible I/O contention if that disk is used for other applications.

Resource: A resource is an entity that may be brought online, offline, or monitored on a particular system. Each separate resource is of a resource type, much like the relationship between a class and its instance in object-oriented programming. The manner in which something is brought online, offline, or monitored is a characteristic of the resource type. Examples of resource types are mount points, IP addresses, and software processes.

There are three categories of VCS resources: on-off, on-only, and persistent. On-off means VCS can fully control the resource; on-only is a resource that VCS can restart but not shutdown; and a persistent resource is something that VCS will not control, just monitor. A good example of a persistent resource is a network card. A NIC cannot be started or stopped, but VCS will require it to configure IP addresses.

Resource agent: A resource agent, or just agent, is a collection of executables that manage a resource type. An agent must be able to online, offline, monitor, and clean a resource. These operations are called the "entry points." The online function brings a resource up on a system, while the monitor function reports whether a resource is online or offline. The offline function will shut down a resource; clean is used if the offline procedure did not shutdown the resource completely. Veritas bundles several basic agents with VCS. Enterprise agents for commercial software -- such as Oracle and Sybase -- are available from Veritas for an additional fee. You can also build your own agents as long as they conform to the agent guidelines.

Service group: A service group is a logical collection of resources. These resources will be taken online and offline together. Service groups come in two varieties -- failover and parallel. Parallel groups require that the applications within them will allow multiple access to the same data. A failover service group can only be online on one system at a time. Three policies define the startup and failover policy of service groups. They are priority, round robin, and load. Priority is the default; you set a priority with each system in the SystemList attribute of the service group. Round robin puts the service group on the system hosting the fewest number of service groups. The load policy is the most powerful and flexible. It selects the system with the lowest load as calculated by system capacity and service group load.

Dependency: A dependency relationship tells the cluster in what order to bring resource entities online and offline. In each resource dependency relationship there is a parent and a child. A parent resource will not be brought online until all of its children are online. Conversely, the parents are taken offline before the children. If the dependencies are not defined, than VCS will attempt to bring up all resources in parallel. This could cause a situation where a mount point may attempt to go online before the necessary volumes are ready. Service groups may have dependencies to other service groups and that relationship works much the same way, but their dependencies can go across servers.

Split brain: Split brain occurs when two or more systems within the cluster think they have exclusive access to a shared resource at the same time. This can be very damaging because data corruption is common in this situation.

Jeopardy: A system is in jeopardy when only one of its heartbeat connections is still functioning. A loss of the remaining heartbeat network will not allow VCS to know whether the host has crashed or the last heartbeat network has been disabled.

VCS Communication

Heartbeat communication takes place with Group Atomic Broadcast (GAB) and Low Latency Transport (LLT). These are additional software packages and kernel modules included with VCS. GAB runs over LLT and is analogous to UDP running over IP. LLT links are customarily run over private networks, either via Ethernet crossover cables or separate network switches. LLT also has the concept of a low-priority link.

This link is a backup to the normal communication channels and is not fully utilized unless the other connections are disabled. Typically this link is run over a normal Ethernet network and not segregated as the primary links. If you have a backup or administrative network, that would be a good choice for your low-priority network. VCS will require at least two separate heartbeat communication channels unless overridden. Both LLT and GAB need to be configured and running on all cluster systems before VCS can be started on the cluster.

Communication between the various components of VCS is managed by the high-availability daemon, also known as "had." "Had" exchanges information between the userspace components (e.g., resource agents and CLI tools) and the kernel space components (LLT and GAB). Working alongside "had" is a process called "hashadow", whose job it is to monitor the "had" process and restart it if necessary.

VCS keeps numerous log files for debugging and monitoring in /var/VRTSvcs/log. The primary log is from "had", called the engine log. Additionally, each resource agent maintains its own log. The "halog" utility can be used to display information and contents for the engine log. To send alert messages via email or SNMP, VCS includes a notifier component that interfaces with "had". Along with the notifier, VCS can take a defined action in response to particular events. These are called "event triggers" and act similarly to database triggers.

Configuration

There are several tools to view and modify VCS. These include several command-line utilities and a GUI tool called Cluster Manager, which comes with a Java console and a Web-based front-end. As with most things on Unix, it is best to understand how to use all the command-line utilities and not rely only on the GUI tools.

There are two ways to access the Veritas tools to view or modify the cluster. You can utilize a user account defined in VCS or have root access on one of the cluster systems. Access to VCS is restricted based on several user categories within VCS. They are cluster administrator, cluster operator, group administrator, group operator, and cluster guest. Each category has all the privileges of the lower categories. For example, group administrator can do all the functions of a group operator and cluster guest. Users with root access can bypass VCS authorization and run any of the command-line utilities with cluster administrator privileges. New users by default are in the cluster guest category until explicitly put into one or more of the other categories. Broadly speaking, guest users can only view the state of things; operators can view and change the state of things but not modify the configuration; administrators can do anything.

Configuration for VCS is stored in two files with similar formats -- main.cf and types.cf -- both located in /etc/VRTSvcs/conf/config. The types.cf file holds information about each resource type. The main.cf holds information specific to the cluster -- users, resources, service groups, and dependencies. Changes made to the main.cf are performed in memory and not saved to disk until a configuration dump is performed.

Configuration for LLT and GAB are held in the /etc/llttab and /etc/gabtab files. Detailed sample files are supplied by Veritas in their respective installation directories. There are many configuration options for LLT but the minimum options needed to operate are the node id, cluster number, and network links to be used for communication. Only nodes with the same cluster number will be able to communicate with each other, and each node in the cluster must have a unique node id. For GAB, the only required configuration option is to list the number of nodes in the cluster.

Building a Sample Cluster

Putting all this together, let's see a small, two-node cluster in action. The sample hardware will be a pair of Sun v240 servers attached to EMC storage running a custom widget application. Our sample cluster will be fully redundant to the host level and there should be no single point of failure (SPOF). This design is Veritas best practices for building a cluster. The v240 servers have four network ports built-in (bge ports) and three PCI card slots. We will install two PCI Fibre cards for redundant connection to the storage. The last PCI slot is for a quad Ethernet card (qfe ports) to back up the four internal Ethernet ports. There will be one failover service group for our application. Some common applications used in this type of cluster environment are Oracle, IBM MQSeries, and NFS servers.

There will be two heartbeat communication links in addition to a low-priority link and two paths to the public network. The host will have mirrored root disks and redundant power supplies. The first steps are to install all the hardware, set up Solaris, mirror the root disks, and configure the storage to be visible to both servers. I will assume you have created just one Veritas diskgroup and volume for this cluster. The diskgroups should be deported before starting to configure the cluster.

The network should be run from Ethernet port bge0 and the backup cable to the quad card port qfe0. Run crossover Ethernet cables for the heartbeats using ports bge1 and qfe1 on each server. Use the next bge port for an administrative network connection. Install VCS via the install script supplied by Veritas. It will ask for your license keys, otherwise you would have to install them manually with the halic command. Set up your gabtab and llttab files. Your gabtab file should look like:

/sbin/gabconfig -c -n2

In your llttab file, look for the "set-node", "set-cluster", and "link" lines. For our cluster, your file should look like:

set-node 0 (Other node would have a 1)
set-cluster 1
link bge0 /dev/bge:0 - ether - -
link qfe0 /dev/qfe:0 - ether - -
link-lowpri bge2 /dev/bge:2 - ether - -

Once the files are complete, start LLT and then GAB via their init.d startup scripts. You can confirm the hosts see each other with lltstat and gabconfig. Here is sample lltstat -n output from a working cluster:

LLT node information:
Node               State     Links
 * 0 systemA        OPEN       3
   1 systemB        OPEN       3

Once they are working correctly, you can start VCS from its init.d script. We can then use hastatus -summary to confirm that VCS is running on all systems. We are now ready to configure VCS. Start by making the cluster configuration file writeable. Then we can add a user who will be an administrator for the entire cluster. This user can be used to access the Cluster Manager GUI:

# haconf -makerw
# hauser -add clusadmin (This will ask you to set a password for the account)
# haclus -modify Administrators -add clusadmin

Once that is done, you can add the system names to the cluster. Then add the service group and define the systems on which it can be run. The numbers you see are the priority for that system:

# hasys -add systemA
# hasys -add systemB
# hagrp -add my_grp
# hagrp -modify my_grp SystemList -add systemA 1
# hagrp -modify my_grp SystemList -add systemB 2

At this point, we will create our resources. Most resources have various attributes that can be set. For this sample, we will only change the required attributes, but you should examine the bundled agents' reference guide to see all configurable settings. Create the diskgroup, volume, and mount resources and modify their attributes. Then link these resource dependencies that come online in this order: diskgroup, volume, and mount point:

# hares -add my_diskgroup DiskGroup my_grp
# hares -modify my_diskgroup DiskGroup veritas_diskgroup_name
# hares -add my_volume Volume my_grp
# hares -modify my_volume Volume veritas_volume_name
# hares -modify my_volume DiskGroup veritas_diskgroup_name
# hares -add my_mount Mount my_grp
# hares -modify my_mount MountPoint /clustermount
# hares -modify my_mount FSType vxfs
# hares -modify my_mount Fsckopt %-y
# hares -modify my_mount BlockDevice /dev/vx/dsk/veritas_diskgroup_name/veritas_volume_name
# hares -link my_volume my_diskgroup
# hares -link my_mount my_volume

Now we can add the network resources. The MultiNICA resource will control network failover between the bge0 and qfe0 ports on our sample servers. Here we specify the local bge0 and qf0 ports and which IP addresses to assign them. This does not take the place of Solaris assigning IP addresses at boot time. VCS is merely monitoring the status of the network links. The IPMulticNIC resource is the virtual IP for the cluster with which clients will communicate:

# hares -add my_multinic MultiNICA my_grp
# hares -local my_multinic Device
# hares -modify my_multinic Device bge0 192.168.0.1 -sys systemA
# hares -modify my_multinic Device qfe0 192.168.0.1 -sys systemA
# hares -modify my_multinic Device bge0 192.168.0.2 -sys systemB
# hares -modify my_multinic Device qfe0 192.168.0.2 -sys systemB
# hares -add my_ipaddress IPMultiNIC my_grp
# hares -modify my_ipaddress Address 192.168.0.3
# hares -modify my_ipaddress MultiNICResName my_multinic

The final part of the process is the most important part. Here we will add the application resource. Typically, the application requires the mount point and the IP address to be online before it can start, so we will make those dependencies. Figure 1 shows the dependency layout in the Cluster Manager GUI. The application agent can monitor your application several different ways, but here we will enable all the resources in the service group. VCS will not attempt to online, offline, or monitor any resources unless they are enabled. Finally, we will dump the running configuration from memory to disk and make the file read-only:

# hares -add my_application Application my_grp
# hares -modify my_application PidFiles /path/to/pidfile
# hares -modify my_application StartProgram /path/to/startup/script
# hares -modify my_application StopProgram /path/to/shutdown/script
# hares -link my_application my_ipaddress
# hares -link my_application my_mount
# hagrp -enableresources my_grp
# haconf -dump -makero

Now we are ready to test everything. A simple online will see that the service group starts on one of the systems. If that works, you can try to switch the group over to the other node:

# hagrp -online my_grp -sys systemA
# hagrp -switch my_grp -to systemB

Additional valuable tests include:

Kill the "had" process and check that "hashadow" restarts it.
Panic an active system so the service group will fail over to another node. This also will test whether the applications can survive such an event.
Pull heartbeat and network cables and make sure everything reacts as expected.

Conclusion

New in VCS 4.0 is a cluster simulator that allows you to test cluster configurations from a Windows PC. The simulator is also available for download from Veritas and includes some sample configurations. Although it is beyond the scope of this article to describe its setup, the simulator is a wonderful tool with which to learn more about VCS setup and to test changes to your existing cluster configurations. The simulator also allows you to fault a system or resource to see the effects on the cluster.

I have described the basics of VCS and how to set up a simple and useful two-node cluster. To further your VCS knowledge, I suggest reading through the VCS User's Guide (a free download from http://www.veritas.com). You can also start practicing with the sample configurations in the simulator. Other cluster software is available from Sun, Sun Cluster (http://www.sun.com), or the open source community, Linux-HA (http://www.linux-ha.org).

References

"Veritas Cluster Server 3.5 User's Guide (Solaris)," 2002 from Veritas -- http://www.veritas.com

Paul Guglielmino (paulg@nepd.com) is a Unix consultant in the Boston area. He has spent the past three years designing, building, and troubleshooting Veritas clusters.