High
Availability Clustering with Veritas Cluster Server
Paul Guglielmino
When one of your systems goes down, don't you wish that your users
wouldn't notice? With the aid of highly available clusters, your
users can continue working as if nothing had happened, and the systems
administrator can deal with the problem undisturbed. This article
provides an introduction to basic clustering concepts with Veritas
Cluster Server (VCS) using Sun Solaris as the sample platform. VCS
can run on a wide verity of Unix and Windows systems and the principles
are the same on all systems, but the implementation may differ slightly
from what is presented here. Knowledge of Veritas Volume Manager
is not required to benefit from this article, but may be helpful
since VCS and volume manager are tightly integrated.
Introduction
In its "Cluster Server 3.5 User's Guide (Solaris)", Veritas defines
a cluster as "multiple systems connected with a dedicated communications
infrastructure". In practice, this means a set of servers with a
shared storage infrastructure that act together to provide a service
or set of services. VCS clusters can have from 1 to 32 servers,
or nodes, and can run a variety of applications from databases to
Web servers.
There are several design models for a cluster. Basic designs are
active/passive, active/active, and N-to-1. An active/passive cluster
consists of two servers -- one running your application and one
sitting idle waiting for a failover event. An active/active cluster
has portions of services running on both nodes with either node
capable of taking over all services in a failover event. On the
other hand, an N-to-1 configuration allows multiple servers to be
backed up by just one server. This configuration provides the 100%
redundancy of the active/passive model without the hardware costs.
Beyond basic design models exist N+1 and N-to-N. An N+1 configuration
has one more additional server than is needed. Any service can run
on any of the cluster nodes. The N-to-N configuration is the active/passive
model on a larger scale. Multiple servers each have one failover
partner. In practice, active/passive, active/active, and N+1 are
the most common design methods depending on the applications involved.
Here is some important terminology and how it relates to VCS:
Heartbeat: Heartbeats are a communication mechanism for
nodes to exchange information concerning hardware and software status,
keep track of cluster membership, and keep this information synchronized
across all cluster nodes. The heartbeats can be passed over a shared
disk or over a network link. If a shared disk is used, there is
a limit of eight nodes in the cluster, and you need to plan for
possible I/O contention if that disk is used for other applications.
Resource: A resource is an entity that may be brought online,
offline, or monitored on a particular system. Each separate resource
is of a resource type, much like the relationship between a class
and its instance in object-oriented programming. The manner in which
something is brought online, offline, or monitored is a characteristic
of the resource type. Examples of resource types are mount points,
IP addresses, and software processes.
There are three categories of VCS resources: on-off, on-only,
and persistent. On-off means VCS can fully control the resource;
on-only is a resource that VCS can restart but not shutdown; and
a persistent resource is something that VCS will not control, just
monitor. A good example of a persistent resource is a network card.
A NIC cannot be started or stopped, but VCS will require it to configure
IP addresses.
Resource agent: A resource agent, or just agent, is a collection
of executables that manage a resource type. An agent must be able
to online, offline, monitor, and clean a resource. These operations
are called the "entry points." The online function brings a resource
up on a system, while the monitor function reports whether a resource
is online or offline. The offline function will shut down a resource;
clean is used if the offline procedure did not shutdown the resource
completely. Veritas bundles several basic agents with VCS. Enterprise
agents for commercial software -- such as Oracle and Sybase -- are
available from Veritas for an additional fee. You can also build
your own agents as long as they conform to the agent guidelines.
Service group: A service group is a logical collection
of resources. These resources will be taken online and offline together.
Service groups come in two varieties -- failover and parallel. Parallel
groups require that the applications within them will allow multiple
access to the same data. A failover service group can only be online
on one system at a time. Three policies define the startup and failover
policy of service groups. They are priority, round robin, and load.
Priority is the default; you set a priority with each system in
the SystemList attribute of the service group. Round robin puts
the service group on the system hosting the fewest number of service
groups. The load policy is the most powerful and flexible. It selects
the system with the lowest load as calculated by system capacity
and service group load.
Dependency: A dependency relationship tells the cluster
in what order to bring resource entities online and offline. In
each resource dependency relationship there is a parent and a child.
A parent resource will not be brought online until all of its children
are online. Conversely, the parents are taken offline before the
children. If the dependencies are not defined, than VCS will attempt
to bring up all resources in parallel. This could cause a situation
where a mount point may attempt to go online before the necessary
volumes are ready. Service groups may have dependencies to other
service groups and that relationship works much the same way, but
their dependencies can go across servers.
Split brain: Split brain occurs when two or more systems
within the cluster think they have exclusive access to a shared
resource at the same time. This can be very damaging because data
corruption is common in this situation.
Jeopardy: A system is in jeopardy when only one of its
heartbeat connections is still functioning. A loss of the remaining
heartbeat network will not allow VCS to know whether the host has
crashed or the last heartbeat network has been disabled.
VCS Communication
Heartbeat communication takes place with Group Atomic Broadcast
(GAB) and Low Latency Transport (LLT). These are additional software
packages and kernel modules included with VCS. GAB runs over LLT
and is analogous to UDP running over IP. LLT links are customarily
run over private networks, either via Ethernet crossover cables
or separate network switches. LLT also has the concept of a low-priority
link.
This link is a backup to the normal communication channels and
is not fully utilized unless the other connections are disabled.
Typically this link is run over a normal Ethernet network and not
segregated as the primary links. If you have a backup or administrative
network, that would be a good choice for your low-priority network.
VCS will require at least two separate heartbeat communication channels
unless overridden. Both LLT and GAB need to be configured and running
on all cluster systems before VCS can be started on the cluster.
Communication between the various components of VCS is managed
by the high-availability daemon, also known as "had." "Had" exchanges
information between the userspace components (e.g., resource agents
and CLI tools) and the kernel space components (LLT and GAB). Working
alongside "had" is a process called "hashadow", whose job it is
to monitor the "had" process and restart it if necessary.
VCS keeps numerous log files for debugging and monitoring in /var/VRTSvcs/log.
The primary log is from "had", called the engine log. Additionally,
each resource agent maintains its own log. The "halog" utility can
be used to display information and contents for the engine log.
To send alert messages via email or SNMP, VCS includes a notifier
component that interfaces with "had". Along with the notifier, VCS
can take a defined action in response to particular events. These
are called "event triggers" and act similarly to database triggers.
Configuration
There are several tools to view and modify VCS. These include
several command-line utilities and a GUI tool called Cluster Manager,
which comes with a Java console and a Web-based front-end. As with
most things on Unix, it is best to understand how to use all the
command-line utilities and not rely only on the GUI tools.
There are two ways to access the Veritas tools to view or modify
the cluster. You can utilize a user account defined in VCS or have
root access on one of the cluster systems. Access to VCS is restricted
based on several user categories within VCS. They are cluster administrator,
cluster operator, group administrator, group operator, and cluster
guest. Each category has all the privileges of the lower categories.
For example, group administrator can do all the functions of a group
operator and cluster guest. Users with root access can bypass VCS
authorization and run any of the command-line utilities with cluster
administrator privileges. New users by default are in the cluster
guest category until explicitly put into one or more of the other
categories. Broadly speaking, guest users can only view the state
of things; operators can view and change the state of things but
not modify the configuration; administrators can do anything.
Configuration for VCS is stored in two files with similar formats
-- main.cf and types.cf -- both located in /etc/VRTSvcs/conf/config.
The types.cf file holds information about each resource type. The
main.cf holds information specific to the cluster -- users, resources,
service groups, and dependencies. Changes made to the main.cf are
performed in memory and not saved to disk until a configuration
dump is performed.
Configuration for LLT and GAB are held in the /etc/llttab and
/etc/gabtab files. Detailed sample files are supplied by Veritas
in their respective installation directories. There are many configuration
options for LLT but the minimum options needed to operate are the
node id, cluster number, and network links to be used for communication.
Only nodes with the same cluster number will be able to communicate
with each other, and each node in the cluster must have a unique
node id. For GAB, the only required configuration option is to list
the number of nodes in the cluster.
Building a Sample Cluster
Putting all this together, let's see a small, two-node cluster
in action. The sample hardware will be a pair of Sun v240 servers
attached to EMC storage running a custom widget application. Our
sample cluster will be fully redundant to the host level and there
should be no single point of failure (SPOF). This design is Veritas
best practices for building a cluster. The v240 servers have four
network ports built-in (bge ports) and three PCI card slots. We
will install two PCI Fibre cards for redundant connection to the
storage. The last PCI slot is for a quad Ethernet card (qfe ports)
to back up the four internal Ethernet ports. There will be one failover
service group for our application. Some common applications used
in this type of cluster environment are Oracle, IBM MQSeries, and
NFS servers.
There will be two heartbeat communication links in addition to
a low-priority link and two paths to the public network. The host
will have mirrored root disks and redundant power supplies. The
first steps are to install all the hardware, set up Solaris, mirror
the root disks, and configure the storage to be visible to both
servers. I will assume you have created just one Veritas diskgroup
and volume for this cluster. The diskgroups should be deported before
starting to configure the cluster.
The network should be run from Ethernet port bge0 and the backup
cable to the quad card port qfe0. Run crossover Ethernet cables
for the heartbeats using ports bge1 and qfe1 on each server. Use
the next bge port for an administrative network connection. Install
VCS via the install script supplied by Veritas. It will ask for
your license keys, otherwise you would have to install them manually
with the halic command. Set up your gabtab and llttab files.
Your gabtab file should look like:
/sbin/gabconfig -c -n2
In your llttab file, look for the "set-node", "set-cluster", and "link"
lines. For our cluster, your file should look like:
set-node 0 (Other node would have a 1)
set-cluster 1
link bge0 /dev/bge:0 - ether - -
link qfe0 /dev/qfe:0 - ether - -
link-lowpri bge2 /dev/bge:2 - ether - -
Once the files are complete, start LLT and then GAB via their init.d
startup scripts. You can confirm the hosts see each other with lltstat
and gabconfig. Here is sample lltstat -n output from
a working cluster:
LLT node information:
Node State Links
* 0 systemA OPEN 3
1 systemB OPEN 3
Once they are working correctly, you can start VCS from its init.d
script. We can then use hastatus -summary to confirm that VCS
is running on all systems. We are now ready to configure VCS. Start
by making the cluster configuration file writeable. Then we can add
a user who will be an administrator for the entire cluster. This user
can be used to access the Cluster Manager GUI:
# haconf -makerw
# hauser -add clusadmin (This will ask you to set a password for the account)
# haclus -modify Administrators -add clusadmin
Once that is done, you can add the system names to the cluster. Then
add the service group and define the systems on which it can be run.
The numbers you see are the priority for that system:
# hasys -add systemA
# hasys -add systemB
# hagrp -add my_grp
# hagrp -modify my_grp SystemList -add systemA 1
# hagrp -modify my_grp SystemList -add systemB 2
At this point, we will create our resources. Most resources have various
attributes that can be set. For this sample, we will only change the
required attributes, but you should examine the bundled agents' reference
guide to see all configurable settings. Create the diskgroup, volume,
and mount resources and modify their attributes. Then link these resource
dependencies that come online in this order: diskgroup, volume, and
mount point:
# hares -add my_diskgroup DiskGroup my_grp
# hares -modify my_diskgroup DiskGroup veritas_diskgroup_name
# hares -add my_volume Volume my_grp
# hares -modify my_volume Volume veritas_volume_name
# hares -modify my_volume DiskGroup veritas_diskgroup_name
# hares -add my_mount Mount my_grp
# hares -modify my_mount MountPoint /clustermount
# hares -modify my_mount FSType vxfs
# hares -modify my_mount Fsckopt %-y
# hares -modify my_mount BlockDevice /dev/vx/dsk/veritas_diskgroup_name/veritas_volume_name
# hares -link my_volume my_diskgroup
# hares -link my_mount my_volume
Now we can add the network resources. The MultiNICA resource will
control network failover between the bge0 and qfe0 ports on our sample
servers. Here we specify the local bge0 and qf0 ports and which IP
addresses to assign them. This does not take the place of Solaris
assigning IP addresses at boot time. VCS is merely monitoring the
status of the network links. The IPMulticNIC resource is the virtual
IP for the cluster with which clients will communicate:
# hares -add my_multinic MultiNICA my_grp
# hares -local my_multinic Device
# hares -modify my_multinic Device bge0 192.168.0.1 -sys systemA
# hares -modify my_multinic Device qfe0 192.168.0.1 -sys systemA
# hares -modify my_multinic Device bge0 192.168.0.2 -sys systemB
# hares -modify my_multinic Device qfe0 192.168.0.2 -sys systemB
# hares -add my_ipaddress IPMultiNIC my_grp
# hares -modify my_ipaddress Address 192.168.0.3
# hares -modify my_ipaddress MultiNICResName my_multinic
The final part of the process is the most important part. Here we
will add the application resource. Typically, the application requires
the mount point and the IP address to be online before it can start,
so we will make those dependencies. Figure 1 shows the dependency
layout in the Cluster Manager GUI. The application agent can monitor
your application several different ways, but here we will enable all
the resources in the service group. VCS will not attempt to online,
offline, or monitor any resources unless they are enabled. Finally,
we will dump the running configuration from memory to disk and make
the file read-only:
# hares -add my_application Application my_grp
# hares -modify my_application PidFiles /path/to/pidfile
# hares -modify my_application StartProgram /path/to/startup/script
# hares -modify my_application StopProgram /path/to/shutdown/script
# hares -link my_application my_ipaddress
# hares -link my_application my_mount
# hagrp -enableresources my_grp
# haconf -dump -makero
Now we are ready to test everything. A simple online will see that
the service group starts on one of the systems. If that works, you
can try to switch the group over to the other node:
# hagrp -online my_grp -sys systemA
# hagrp -switch my_grp -to systemB
Additional valuable tests include:
- Kill the "had" process and check that "hashadow" restarts it.
- Panic an active system so the service group will fail over
to another node. This also will test whether the applications
can survive such an event.
- Pull heartbeat and network cables and make sure everything
reacts as expected.
Conclusion
New in VCS 4.0 is a cluster simulator that allows you to test
cluster configurations from a Windows PC. The simulator is also
available for download from Veritas and includes some sample configurations.
Although it is beyond the scope of this article to describe its
setup, the simulator is a wonderful tool with which to learn more
about VCS setup and to test changes to your existing cluster configurations.
The simulator also allows you to fault a system or resource to see
the effects on the cluster.
I have described the basics of VCS and how to set up a simple
and useful two-node cluster. To further your VCS knowledge, I suggest
reading through the VCS User's Guide (a free download from http://www.veritas.com).
You can also start practicing with the sample configurations in
the simulator. Other cluster software is available from Sun, Sun
Cluster (http://www.sun.com), or the open source community,
Linux-HA (http://www.linux-ha.org).
References
"Veritas Cluster Server 3.5 User's Guide (Solaris)," 2002 from
Veritas -- http://www.veritas.com
Paul Guglielmino (paulg@nepd.com) is a Unix consultant
in the Boston area. He has spent the past three years designing,
building, and troubleshooting Veritas clusters. |