TM Failover -- Keeping Connected at All Times
Cover V12, I05

Article
Listing 1
Listing 2
Listing 3
Listing 4
Listing 5

may2003.tar

SolarisTM Failover -- Keeping Connected at All Times

Brian Gollsneider and Arthur Messenger

Success in today's high-tech world demands high-availability systems. Five or six 9s availability requires high-end hardware, stable operating systems, and a stable connection to the network. In SolarisTM 8, Sun introduced IP Network Multipathing. This capability allows administrators to create a hot standby for a network interface card (NIC) or to configure several active NICs on a machine in a multipath group to back up each other. The hot standby can take over for a failed primary card in as little as 100 ms. In this article, we present how to configure a system for failover, then describe the network impact of multipath groups, how resilient normal network applications are to timeouts, and the system logging and notification of the appropriate events. We assume a working Solaris 8 system that is on a network and a second available network card. Ethereal was used to monitor and record the network activity.

Background

Network failover is the ability to recover from a network problem on one network path and switch to another. The failure can be the network card itself dying, the network cable being cut or disconnected, or some other equivalent event. Note that we forced network failures by physically disconnecting the network cable at the appropriate time. Sun's IP Network Multipathing has three main parts: failure detection, repair detection, and outbound load spreading.

Failure detection is sensing when a link is no longer good. On the other hand, repair detection is determining when the link is good again. Outbound load spreading is dividing the network traffic leaving the system between the network interfaces. Earlier releases of Solaris supported multiple network cards but did not have failover. The common approach to simulate failover previously was to write a script that continually pinged a host. If an answer was received, then no action was taken because the network connection was good. If the ping failed, then that interface was brought down and a backup interface was configured and activated. Although this approach worked, it had limitations and was not very elegant.

With Solaris 8, administrators get the ability to configure network failover in several ways. The two primary ways are standby and active. Standby is where the primary network card is used until a failure and then the system switches to the standby card; active is where both cards are active until one fails, at which time all traffic is sent through the remaining card.

Details of IP Network Multipathing

IP Network Multipathing uses a daemon, /sbin/in.mpathd, to watch over a group of NICs. A private address used only by in.mpathd is established on each NIC. The in.mpathd daemon issues echo requests, a ping, to a node on the IP link1. Note that the node is the default router if there is one. If there is no router, the node is determined by sending a multicast packet to the "all hosts" multicast address, 224.0.0.1. The first few hosts to reply become the node. In our small test network with nine other hosts, in.mpathd starting echo requesting five of the nine responding hosts in a random fashion. If there are five consecutive echo request failures on a NIC in the group that in.mpathd is watching, failure has been detected and the link is declared not to be functioning.

The NIC in the group with the least number of logical interfaces has a logical interface created on it for the failed NIC's IP address by in.mpathd. IP will then start using this new NIC. The in.mpathd daemon continues to send echo requests on the failed NIC while it has been declared non-functioning. When it has 10 consecutive echo request successes on the failed NIC's private address (i.e., it has detected the repair of the link), in.mpathd re-establishes the IP address on the NIC and removes the logical interface on the new NIC.

IP is now using the original NIC. What we have called a private address is really a deprecated IP address -- an address that IP will not use unless explicitly told to. These private addresses must be visible to the IP link. This usually implies that the private IP address has the same network address as the echo request responding node. At the least, the echo request responder must have echo response turned on in the IP stack or in.mpathd will have no way of seeing whether the NIC is down.

If there is no router on the network, then the echo request responder must at least have address 224.0.0.1, "all hosts" multicast address active. The file /etc/default/mpathd is created during installation and controls several aspects of in.mpathd's behavior, the most important of which are detection time for a failure and whether failback is allowed. Listing 1 shows /etc/default/mpathd with the default comments removed.

FAILURE_DETECTION_TIME is set to 10000 milliseconds or 10 seconds by default. This can be dropped as low as 100 ms for time-critical applications connectivity. This is the time to determine a NIC failure, which is defined as five consecutive echo request failures. The system therefore divides FAILURE_DETECTION_TIME into five approximately equal time segments to do the pings. Of course, smaller values for FAILURE_DETECTION_TIME place a higher load on the network. FAILBACK=yes tells in.mpathd to go back to the original NIC if it determines that it has been repaired. We did not work with TRACK_INTERFACES_ONLY_WITH_GROUPS.

TRACK_INTERFACES_ONLY_WITH_GROUPS=yes is the default. If this is no, in.mpathd will report on all failed NICs on the node even if they are not in the multipath group. The network events discussed above get logged in /var/adm/messages so the system activity tool of your choice (i.e., swatch) can be set up to notify you as necessary. See the References for more details than provided in this article.

Configuring a Hot Standby NIC (Command Line)

These are the steps at the command line to set up a NIC as a hot standby to take over network if the primary card failed. Listing 2 shows the commands and standard out for the configuration.

Step 1: Check the state of the current network interface. The first ifconfig -a (command [1]) in Listing 2 shows that interface iprb0 is up using 10.1.1.1 as part of the 10.1.1.0 network, so we have verified that the system is in a normal network state.

Step 2: Configure the primary card; this is shown by command [2] in Listing 2. We chose the unique address on the network of 10.1.1.200 to use as the private address. Since this was our first command with group SERVER1, this created the IP multipathing group, named it SERVER1 and added the NIC associated with iprb0 to it. It also started the in.mpathd daemon. The addif 10.1.1.200 netmask 255.255.255.0 broadcast 10.1.1.255 added a logical interface, ipbr0:1, to the NIC. The -failover marks the 10.1.1.200 as a non-failover address. That is, in.mpathd will not make a logical interface for it on another NIC if this NIC should fail. The option deprecated marks 10.1.1.200 as not being available as a source address for outbound packets unless explicitly asked for (bound). Finally, up enables the logical interface just created. The second ifconfig -a (command [3]) shows the successful completion of the command with the new interface iprb0:1. Notice that the logical interface iprb0:1 is DEPRECATED and NOFAILOVER.

Step 3: Configure the hot standby card; this is shown by command [4] in Listing 2. We chose 10.1.1.201 for this address. The plumb option sets up the connections between the device driver and the NIC, 10.1.1.201 netmask 255.255.255.0 broadcast 101.1.255 sets up the IP address for this NIC. The address is to be deprecated, a member of the group SERVER1, and not failover if the NIC fails. The standby option marks this NIC as a hot backup for a failed NIC in the group SERVER1. Finally, the up option at the end enables the interface. The final ifconfig -a (command [5]), shows iprb1 successfully configured to be a hot standby. Later, we will describe some testing to determine how failover performs.

Configuring a Hot Standby NIC (Startup Scripts)

Next, we will quickly show the equivalent syntax to preserve failover across reboots. The idea is the same but the implementation syntax varies to a good extent. Initializing a NIC is a relatively complex activity controlled by the startup script /etc/init.d/network. This script uses the file /etc/hostname.NICdriver to determine whether a NIC is to be initialized. Normally, this contains only the hostname associated with the NIC. This file becomes greatly expanded if you use multipathing failover. Listing 3 shows first the contents of /etc/hostname.iprb0, the primary NIC card, and then /etc/hostname.iprb1 -- the hot standby. Again, we don't need to manually start in.mpathd. It gets started by using group in the configuration files. One caution on the importance of proper syntax -- on the 10.1.1.11 address, we left out the "up" the first time. This produced an error that DNS would not work, reporting an error about not being about to find the hostname of the DNS server.

IP Network Multipathing Testing

By following the above steps, we have configured a primary network card and a hot standby. For initial testing, we changed the FAILURE_DETECTION_TIME to the minimum 100 ms in /etc/default/mpathd. Listing 4 shows an extract of the system event logging in /var/adm/messages as we put traffic on the network and forced network failures and repairs by pulling network cables. Note that as the system becomes loaded, it cannot keep a failure detection time of 100 ms and reports what it actually can do. Also, note the various messages about NIC failures, successful failover to iprb1, and repair detection and failback to iprb0. Our conclusion is that failover works as advertised although very short failure detection times may not be supportable under load.

The other part of our testing focused on the connection impact of various FAILURE_DETECTION_TIMEs. Small values of the parameter add greater network traffic in the form of more heartbeat pings and might not even be supportable because of system load. We therefore put the parameter back to its default 10000-ms value and checked the normal UNIX connection utilities (telnet, ftp, ssh) for the impact of failover. We found that no utility would time out with a value of 10 seconds. We were able to repeatedly pull cables, going back and forth between the cards, and never encountered a failure. We transferred a 100-MB file with ftp through four failovers back and forth without any problems. We conclude that the default 10-second value for FAILURE_DETECTION_TIME is adequate for many applications but that each application will need to be tested.

Configuring Multiple NICs (Startup Scripts)

Configuring the system for two active cards is very similar to the standby setup. Listing 5 shows the /etc/hostname.iprb1 setup. It is now a clone of /etc/hostname.iprb1 with different IP values. The /etc/hostname.iprb0 file has no changes. With this setup, the system can failover and recover in either direction, and it has the advantage of greater throughput because of the second active NIC.

Conclusions

We found that Sun's IP Network Multipathing facility provides a hot standby for NIC card and network hardware failures. It can be configured as a primary NIC with a hot standby or as two primary NICs failing over to the other as necessary. This capability should be useful to many people, especially regarding systems with very high-availability requirements. It is very easy to configure your system to failover as required. The default timeout value of 10 seconds is adequate for many applications, but the required timeout value for each application will have to be determined. Very small timeout values may not be supportable by a loaded system. System events are logged in /var/adm/messages.

References

IP Networking Multipathing Administration Guide, Part 806-7931-10, Sun Microsystems, Inc., Palo Alto, CA 94303-4900, April 2001.

Man page -- man in.mpathd

1. An IP link is a communication facility or medium over which nodes can communicate at the link layer. This is the subnetwork access layer of the TCP/IP network model or layer 1 and 2 of the OSI model. Think LAN, or switch, or 10Base5 (thicknet) cable.

Brian Gollsneider is working on a PhD in Electrical Engineering from the University of Maryland. When not buried in research, he is a UNIX instructor for Learning Tree International. He can be reached at: gollsneb@glue.umd.edu.

Arthur M. Messenger is a retired UNIX systems administrator who occasionally answers questions for friends and works part time for Learning Tree International. When not teaching, he lives with his wife in Haymarket, Virginia where they spend time with their grandchildren. He can be reached at: Arthur.Messenger@att.net.