Article

aug2003.tar

The Secrets of openMosix

Richard Ferri

The ultimate promise of clusters is one of inexpensive, scalable compute power fueled by commodity hardware. openMosix expands that promise to include open source software. Luckily for programmers and administrators, openMosix fulfills the promise of an open, scalable, commodity cluster solution.

For those of you who have heard of Mosix, but are wondering what openMosix is, I'll provide a bit of background. The Mosix project, headed by Professor Amnon Barak, began in 1977 at the Hebrew University of Jerusalem. The goal of Mosix is to transform an unwieldy and perhaps heterogeneous pile of computers into a single efficient cluster computing resource. This transformation is based on a set of kernel changes, which provide a single system image cluster, and a scheduler, which deftly moves processes to the various nodes of the cluster based on a "least loaded" algorithm.

The Mosix project underwent a serious schism in January 2002, which spawned a new generation of Mosix -- the openMosix project. In a dispute over the "commercial future" of Mosix, Dr. Moshe Bar, the Mosix co-project manager since 1999, started a new company (Qlusters, Inc.) and forked the Mosix project into the new offering, openMosix. openMosix has quickly caught on in the clustering space, consistently at or near the top of the most popular clustering projects list at SourceForge.

Although the long legacy of Mosix spans more than two decades, with roots as far back as the PDP-11/45 running Bell Labs UNIX 6, and offshoots into VAX machines and BSD UNIX, this article will only address the Linux version of openMosix on X86 machines.

Single System Image

To fully appreciate the beauty and value of openMosix, we must understand that big term in the opening paragraphs, the "single system image" (SSI) cluster. In the book In Search of Clusters, Gregory Pfister defines SSI as "the illusion, created by software or hardware, that a collection of computing elements is a single computing resource". Pfister introduces the idea that there are levels of SSI in a cluster, that some subsystems may recognize the cluster as SSI, and some may not. The user perspective comes into play as well. For example, when a user goes online to order a book from amazon.com, he sees amazon.com as a single computing resource, a single image, without seeing the underlying infrastructure that delegates his request to an individual server. So when referring to openMosix as an SSI cluster, we must talk about how openMosix has addressed the various shades of SSI, specifically from the perspective of:

1. The process
2. The user
3. The filesystem

In Unix, a program is an executable file, and a process is an instance of the program in execution (see Bach). We can consider a process space as a collection of all the information about the running processes on a system. In a traditional, non-SSI environment, every machine would have its own individual process space, and every space would have to be managed separately. In openMosix, there is the concept of a Unique Home Node, or UHN. When a process is spawned in an openMosix cluster, it must originate on a single node, and that node is known as the UHN. From the UHN, the process can be migrated to any other less heavily loaded node in the cluster. However, from the perspective of the UHN, an openMosix cluster forms a single, cluster-wide process space. All the processes that originate from a UHN, and are later migrated to other nodes in the cluster, can be administered directly from the UHN just as if the migrated processes were local to the UHN. This administration of remote processes from their UHN is transparent to the end user, and creates a single system illusion with regard to the process space.

To distribute the workload across a cluster, the openMosix kernel itself makes the decision to migrate the process from a node that's more heavily loaded to one that's less loaded. In a homogeneous cluster, where all nodes have the same compute power in terms of CPU speed and memory, the decision to migrate a process can be fairly straightforward. One of the features of openMosix is that the openMosix kernel tries to assess the "horsepower" of each node in a heterogeneous cluster, where nodes have different speeds and amounts of memory. Admittedly, combining disparate factors like CPU speed and amounts of memory to arrive at an overall horsepower rating for the node is a bit of educated guesswork. However, this per-node rating allows the openMosix kernel to make a decision when an individual process should be migrated to a less heavily loaded node.

From the user perspective, a single cluster-wide process space greatly simplifies process administration. Sending a signal to a process, to kill it for example, does not require the user to know the node on which the process is actually running. Equally important is that the user does not have to manually balance the workload across the cluster, based on the talents and amount of work of each individual node.

openMosix dynamically balances the cluster workload by moving work from heavily loaded nodes to more lightly loaded nodes. From the user's perspective, this migration is completely transparent. With regard to process management, the end user administers an openMosix cluster just as he would a single workstation.

Besides process management, another aspect of SSI that is evident to the user is how the cluster adjusts to the addition of new nodes. To be completely SSI, the addition of a node would be transparent to the end user -- the new node would seamlessly add horsepower to the running cluster. One of the new features of openMosix is an auto-discovery daemon that allows new nodes to join the cluster without intervention of the administrator. When a new node is booted, the auto-discovery daemon (omdiscd) uses multicast packages to inform the other nodes that the new node is joining the cluster (see Buytaert). Once the new node is evaluated for its horsepower and load, it is assigned work based on the needs of the cluster. Since the addition of new nodes is trivial, we can say that openMosix is very close to the ideal in maintaining SSI when changing the node composition of the cluster.

By now you're probably wondering, how does this cleverly migrated process get access to its filesystems? After all, if a process creates a temporary file on one node, the process would have to maintain access to that file even if the process is migrated to another node. For example, if a process running on node 1 creates a file /tmp/foo, and then the process is migrated to node 2, how does the process access the /tmp/foo file on node 1? openMosix has several alternative solutions for remote file access.

The most simple (and least efficient) approach that openMosix takes to file access is that the openMosix kernel running on the remote node will intercept all I/O requests from the migrated process running on the remote node and send them to the UHN. As you can imagine, doing all file access over the network invariably slows down a remote process. The second approach is for the user to create a consistent cluster-wide view of the data using Network File System (NFS). Not only is maintaining a consistent view of all the filesystems (including permissions and owners) a painful process, it again forces all I/O for a migrated process to be performed over the network. NFS has several traditional problems, the greatest of which is a lack of cache consistency -- if two processes are writing to the same file at the same time, they basically slug it out to see whose writes actually get written out to the file. In clustering, this is known as bad karma.

openMosix provides a much more elegant solution than either the openMosix kernel redirecting the I/O, or NFS. The current openMosix solution for file access is oMFS, or the openMosix File System. oMFS is a modern cluster-wide filesystem that is administered similarly to NFS, but with SSI features that include cache, timestamp, and link consistency. These features ensure that regardless of how many remote processes are writing to a single file at the same time, data integrity and file information are preserved. openMosix has another feature called DFSA, or Direct File System Access, which is based on a cluster filesystem such as oMFS. With the DFSA feature installed, a determination is made as to whether it would be more efficient to execute a remote process remotely, or migrate it back to the node that actually contains the data -- in other words, move the process to the data, instead of the data to the process (see Bar). With the advent of modern cluster filesystems like oMFS, and the performance enhancement of DFSA, we can say that openMosix is SSI with regard to filesystems, and provides superior performance with DFSA.

Now that I've examined openMosix as an SSI solution from the perspective of the process, the user, and the filesystem, I'll look at how to implement an openMosix cluster.

Installing an openMosix Cluster

In this section, I'll provide an overview of how to build your own openMosix cluster. This article is not exhaustive in its treatment of installing openMosix; refer to the excellent HOWTO maintained by Kris Buytaert listed at the end of this article for further details and the latest information.

There are actually two pieces to openMosix -- the openMosix kernel and the userland tools. The openMosix kernel has been modified to provide the Single System Image I talked about previously, and for the Mosix File System. The userland tools provide all the executables you need to form the cluster and manage migrated processes. You have two choices when building your openMosix cluster -- either installing the openMosix kernel and user tools from RPM, or downloading the source and building your own openMosix kernel and user tools. The project documentation says openMosix is supported on Red Hat, Suse, and Debian distributions. I tried openMosix under Mandrake 9.0, and it installed and ran without a problem.

It's certainly easy to install openMosix directly from RPM. The RPMs are on the sourceforge openMosix home page (http://www.sf.net/projects/openmosix, follow the Files link). From there you can download the appropriate kernel RPM (e.g., openmosix.kernel-2.4.20-openmosix2.i386.rpm) and the user tools (e.g., openmosix-tools-0.2.4-1.i386.rpm). Installing the kernel RPM will not update /etc/lilo.conf (the lilo loader configuration file), so you'll have to add a stanza for the new openMosix kernel, and run lilo. Just run the rpm command against the openmosix-tools RPM to install the commands in the correct directories, and you're ready to start up your openMosix cluster.

One of the drawbacks, however, of installing directly from RPMs is that the user must accept the kernel options that the openMosix RPM was originally built with -- for example, process migration is on by default, but oMFS is off. If you want a different set of kernel options, you'll have to build the kernel and the user tools yourself. If you haven't built a Linux kernel before, it's not all that scary. You basically download the source for both the kernel and user tools, and follow a recipe something like this:

1. Install your favorite Linux distribution. (I chose Mandrake 9.0, with network clients and development packages included.)

2. Download the kernel source from kernel.org and copy it to /usr/src. (I chose the 2.4.20 version, linux-2.4.20.tar.bz2.)

3. Download the openMosix kernel patches from the Files link on the sourceforge homepage (from http://www.sf.net/projects/openmosix) and copy it to /usr/src.

4. cd to /usr/src, and untar the kernel source:

tar -jxvf linux-2.4.20.tar.bz2

5. cd to linux-2.4.20 and apply the patches to the kernel source:

zcat .../openMosix-2.4.20-2.gz | patch -p1

6. make xconfig, modify any openMosix settings, and save and exit to create a .config file (the list of all the kernel flags).

7. Build the kernel:

make dep bzImage modules modules_install

8. Take the bzImage you just created and copy it to /boot:

cp 'find . -name bzImage' /boot/openmosix-2.4.20

9. Update /etc/lilo.conf with a new stanza and run lilo. This installs the openMosix kernel.

Once the kernel is built and installed and lilo'ed, it's time to build the user tools.

1. Download the user tools source archive from the Files link on the sourceforge homepage (e.g., openMosixUserland-0.2.4.tgz).

2. Untar the archive:

tar -zxvf openMosixUserland-0.2.4.tgz

3. cd to openMosixUserland-0.2.4 and modify the configuration file as necessary. The only place I had to modify was the OPENMOSIX variable:

OPENMOSIX = /usr/src/linux-2.4.20

4. make all

The make all copies all the openMosix binaries to their respective directories, ready to be executed.

Remember, these recipes are provided as examples. The definitive method for building and installing the kernel, and user tools included in the READMEs are provided with the kernel source archive and the userland tools archive, respectively.

Starting openMosix

To perform a meaningful test of openMosix, you really need a second workstation, to which you've added the openMosix kernel you just created, and once again updated lilo.conf and run lilo. Note that if you copy only the kernel to the second node, and do not copy over or rebuild the user tools, you will not have the necessary openMosix commands to control the processes from the second node. Since this is just a simple test, I copied over only the openMosix kernel and decided to control all my processes from the original node where I built the user tools.

Now that your nodes are rebooted and running the openMosix kernel, it's time to tell openMosix about which nodes are in the cluster. The cluster is formed using the setpe command, supplied with the openMosix user tools. You can define the cluster from the command line, or a simple file that tells openMosix the range of IP addresses that will be included in the cluster, as in:

setpe -w -f /etc/hpc.map

where hpc.map contains:

1 mini.pok.ibm.com 2

which indicates that mini.pok.ibm.com is the first node in the cluster (node 1), it begins at address mini.pok.ibm.com, and extends for 2 nodes. Since the IP address of mini is 192.168.0.1, the address of the second node would be 192.168.0.2. Once you've run the setpe -w command on both nodes to define the cluster, you can run setpe -r to verify that both nodes are indeed recognized by openMosix.

Fun with Processes

Once you've built your cluster, configured openMosix, and informed openMosix that the cluster has two nodes, it's time to create a long running process that can be used as an example of process migration. I stole the example from the openMosix HOWTO, creating a file called testscript:

awk 'BEGIN {for(i=0;i<10000;i++)for(j=0;j<10000;j++);}'

After starting testscript a dozen times or so, in the background (e.g., testscript&) the fun begins. Depending on what is running on the two nodes in your cluster, openMosix will immediately start migrating the processes associated with the multiple testscripts running to the second node of the cluster. Since the migration process is transparent, it will not be obvious that the processes have migrated. If you check the load on both nodes of the cluster, using the mosctl command (you'll have to copy or NFS mount the userland binaries to the second node), you should see that the load on both cluster nodes is similar:

mostcl getload

Both load numbers were around 580 on my cluster. An even better indicator would be to run the simple monitor on the first node, using the mosmon command. It will show graphically that the work is divided between the two nodes. At this point, I decided to grab all the work and migrate all the processes back to the original node, again using the mosctl command:

mosctl bring

Immediately all the processes that were migrated to the second node were re-migrated to the first node. Using mosctl getload showed that the second node's load was nearly zero, and the load on the first node nearly doubled, indicating that the processes had indeed migrated home. Executing mosctl nolstay undoes the effect of mosctl bring. Once mosctl nolstay is executed on the original node, openMosix will distribute some portion of the work to other nodes in the cluster. Again, this can be verified with the mosctl getload command, and by watching the mosmon monitor.

At this point, we can be pretty certain that processes are being migrated around the cluster, based on the load indicators, however, we want to see some increased performance in the application. After all, the real advantage of openMosix is a transparent increase in processing power. To show a measurable performance increase, I wrote a script called "doit" that ran our testscript described above. doit looked something like this:

date
testscript&
testscript&
wait
date

This script checks the date/time, starts two instances of the little counting testscript in the background, and then waits for completion of the child processes before displaying the time of completion. The "wait" is very important -- otherwise the script just starts the two instances of testscript in the background and exits immediately, not waiting for completion. On the original node, I ran mosctl bring to limit all processes to the single node, and ran doit. It took 2:07. Then I ran mosctl nolstay to bring the second node back into the cluster, and re-ran doit. This time, doit took 1:07. Allowing for overhead, the second node cut our processing time nearly in half, demonstrating that the second node did in fact speed up the overall throughput of the cluster. Your mileage may vary.

Conclusion

In this article, I've examined how openMosix provides the single system illusion to the end user with regard to process management and node addition. I've also talked about how a migrated process has data consistency as it's moved from node to node, through the the cluster filesystem oMFS. Yet, this information barely scratches the surface regarding openMosix. This project has a wealth of associated monitors and management tools including a full screen process management GUI, openMosixview (http://www.openmosixview.com), and various daemons to collect and analyze process history data. As the popularity of openMosix grows in the commercial space, more sophisticated tools will emerge over time.

In deciding whether to install an openMosix cluster at your installation, it's important to realize that the smallest manageable unit in openMosix is the process. An application that runs as a single process will not benefit from running under openMosix. If the same application is broken into several processes that can run in parallel, these processes can be distributed throughout an openMosix cluster to enhance performance (see Robbins). Ultimately, if you're looking for a transparent, dynamic method of balancing workload across a Linux cluster, and your workload can run as parallel processes, openMosix may be the solution for you.

Acknowledgements

I thank Moshe Bar for his review and recommendations regarding this article.

Resources

Bach, Maurice J. The Design of the UNIX Operating System. Prentice Hall, 1986, ISBN 0-13-201799-7.

Bar, Moshe. openMosix Internals: How openMosix works. See: http://www.openmosix.sourceforge.net

Bookman, Charles. Linux Clustering. New Riders, 2003, ISBN 1-57870-274-7.

Buytaert, Kris. The openMosix HOWTO: http://howto.ipng.be/openMosix-HOWTO

Historical information on Mosix -- http://www.mosix.org

Pfister, Gregory F. In Search of Clusters. Prentice Hall, 1998, ISBN 0-13-899709-9.

Robbins, Daniel. "Advantages of openMosix on IBM xSeries". See: http://www.ibm.com/servers/esdd/articles/openmosix.html

Richard Ferri is a Senior Programmer in IBM's Linux Technology Center where he contributes to and writes about open source clustering projects. He attended Georgetown University many years ago with a concentration in English. He lives in upstate New York with his wife, Pat, three teenaged sons, and an ever-changing cast of critters. Rich can be reached at: rcferri@us.ibm.com