Cover V14, i01

Article

jan2005.tar

Dirvish for Disk-to-Disk Backups -- An Open Source Success Story

Keith Lofstrom

What will you do if a company that produces one of your critical software applications goes out of business? If it's proprietary software, you may be in big trouble. Open source, on the other hand, leaves you in control (sometimes, too much control). This article tells how I ended up managing an open source backup tool, dirvish.

My integrated circuit design consultancy, KLIC, uses Linux and other open source software for nearly all important tasks. One critical task is backup; I cannot afford to lose days of work to a disk crash or a typing error. In 2003, I moved from tape-based backups of individual machines to network-based disk-to-disk backups. After some experiments using rdump to swappable hard-disk media, J.W. Schultz told me about his disk-to-disk backup program called dirvish. Dirvish is a Perl wrapper around another open source tool, rsync.

Rsync and Dirvish

Rsync copies and maintains images of file systems between computers. Rsync can be used for mirrors, copies, and backups. With rsync, files are not just moved and copied, but are segmented, checksummed, and compared between source and destination machines. Only data segments with changed checksums or modification dates are moved. Rsync's --link-dest option permits Unix-style hard links to files in other directories on the destination machine. Hard-linking allows successive backup images to share identical data files and occupy the same disk space.

Hard disk backup storage, and the linkage between backup images, have other benefits for my consulting business. Some consulting contracts require that all copies of client data be removed at the end of the project. This is impossible with tape-based backup; I cannot remove one client's data and email without wiping all my backup tapes. With my backup data stored as a regular file system on disk, all I have to do is write over the client files; everything hard-linked to those files is written over as well.

Rsync typically uses SSH for secure transport between machines and sets up multiple data pipelines to reduce latency. Rsync has been ported to most flavors of Unix, and rsync clients are also available for Windows and Macintosh. Rsync is the execution engine for many new backup systems, including dirvish.

Dirvish adds configuration-driven automation to rsync, making it suitable for cron-driven nightly backups, driven from by a central backup server. Dirvish permits networks of similar machines to share portions of the backup images that are identical. Dirvish also has a sophisticated expiration process for removing selected older images and recovering backup drive space.

Control of the backup process by the central backup server is critical. Client-driven backup schemes are possible, but this compromises security and performance. Rsync is network-intensive, and only the server can properly sequence client backups to control network loading. The backup server must be absolutely safe from compromise; since it contains images of all clients, it represents a single-point security risk for the entire network. Fortunately, a dirvish backup server can be configured to perform only the outbound SSH connections necessary to rsync to the clients, while ignoring all inbound service requests. This makes the backup machine ("server" may be a misnomer) difficult to attack.

I use dirvish to perform nightly backups of half a dozen machines (about 100 GB total data), on-site and remote, in about one hour. My system data changes infrequently. Given the efficiencies brought by hard-linking between images, I can store 200 or more nightly images on a 250-GB backup drive on my main server. With large IDE hard drives selling for less than 60 cents per gigabyte, I have greatly reduced the cost and bother of making nightly backups.

A worst-case failure might destroy both the server and the backup drive in it. I have multiple IDE backup drives in hot-swappable, removable trays (e.g., see http://www.vipower.com). I then rotate the drives to a fire-resistant safe each day. By rotating the backup drives to the safe (which is easy with hot-swap), even worst-case failures or full security breaches are survivable.

Bare-Metal Recovery

Backup is only half the problem, of course; rebuilding a crashed hard drive is every sys admin's nightmare. I build backup drives with a 5-GB bootable Linux partition and a 2-GB swap partition, slightly reducing the size of the main backup partition. For a bare-metal server recovery, I boot from the backup drive and use simple scripts to rebuild the main server hard drive. A client drive can be rebuilt in a third server drive bay, from a second drive bay on the client, or on a different machine entirely. Full restores now take five minutes of configuration work, followed by a couple of hours of disk-to-disk copying.

Windows and Macintosh OS X drives may be more difficult to restore from bare metal. These file systems are not fully supported by Linux, so the restores must be performed natively, on a live Windows or Macintosh machine, to a second empty disk. This can be done over the network with rsync, but the process will involve careful sequencing of scripts at both ends.

The situation is tolerable for Macintosh OS X. There is rsync support for both UFS/FFS and the older HFS file system. It should not be difficult to run dirvish and its associated scripts natively on one machine under OS X, or between OS X machines (although I have not tried this). If the backup drive is connected with USB2, then hot-swaps can be performed with USB2 mount and unmount.

On Windows, some files may not be accessible to rsync for read and write. This is an intentional "feature" of Windows; Microsoft does not want it to be easy to copy their proprietary code. Microsoft or other third-party licenses may forbid backing up their software to another hard disk.

If you are backing up purchased, proprietary programs or information, you may be legally restricted in what you may back up and how you may do it. Digital Rights Management hardware and software may add further complications. Consult your licenses and your lawyers.

These complications are part of the cost of running proprietary operating systems that use file systems without open specifications. We can still restore most of user-created Windows files and directories over the network, and that should be good enough for helping most Windows users maintain their data integrity day-to-day. If Windows users want robust, inexpensive, easy-to-maintain, legally unencumbered, and technologically advanced solutions to their problems, they may want to consider open source offerings.

Dirvish is a great tool -- it's a little rough around the edges, and needs better documentation, but it's turned a big backup chore into a small one. Other users have also found J.W. Schultz's open source software project very useful. In gratitude, I prepared slide talks and writeups of my experiences with dirvish and presented them at user groups and conventions. J.W. was very helpful in critiquing and improving these presentations.

And Then It All Changed...

The technical editor of Sys Admin magazine, Hal Pomeranz, attended one of my presentations and asked me to write an article about dirvish for the October 2004 issue on backups. I wrote excitedly to J.W. to invite him to co-author the article, which would be a real boost to his consulting company, Pegasystems Technologies. Strangely, there was no response. With the due date for the article looming, I had to tell Hal that no article was possible without the original author of the code.

All I knew about J.W. was that he was a middle-aged consultant in the San Francisco bay area, a good Perl coder, and a fan of the former U.S. space program. I didn't even know his first name, or his phone number. Even his site registration was anonymous. In the open source community, this really doesn't matter much, until you want to find someone.

The original download site for dirvish, http://www.pegasys.ws, went off the air sometime in May. In June, as a precaution, I registered the domains "dirvish.org" and "dirvish.com". If J.W. reappeared, the domains would go to him. After another month without news, I took an archived version of the site (from September 2003) and activated the domains on my Web server. I patched in the latest version of the dirvish Perl code from the Debian archives. Later versions of pegasys are now beginning to appear at http://www.archive.org.

In July 2004, I was told of a March Sacramento newspaper article, describing the death of software consultant Jonathan W. Schultz of Antioch, California, by drowning in Lake Mead. I have since made contact with Jon's parents. Jon was 42 when he died. He was reclusive, yet he had Internet friends all over the world. He contributed heavily to the development of rsync. He was a devout Christian and will be missed by many.

Jonathan was a natural coder who could think in Perl. I once asked him if I could annotate the dirvish Perl code with comments. He replied, "If the code is so littered with comments as to call for a grep -v to make it readable as code, the comments should be removed. You should edit the same code that you read." As Jon's father later explained, "He once told me that when he looked at the printout, it was like looking at a picture of a person, and if something was in the wrong place or malformed it would be very obvious to him. This was a God-given gift that neither of us understood...".

Most programmers do not have that gift, and Perl is a very rich language, making it more difficult to read than other languages. J.W.'s death left dirvish users and developers in a quandary; where do we go with dirvish? How can we improve mission-critical code that we control but do not fully understand?

Continuing the Work

The custom in the open source community is that if the originator of a software package disappears or loses interest, another may take up the code and continue. While I am a mediocre coder, and a Perl newbie, I have managed technical projects and am better than most at technical communication. Since management and communication are dirvish's greatest needs, I took temporary control of the dirvish project. Perhaps a better leader will emerge in time.

And I am diving deep into Perl. Randal Schwartz' books Learning Perl and Perl Objects, References, and Modules have been invaluable to my education, and his regular column in Sys Admin has been useful as well. The Portland-area Perl community is very supportive, and when I don't understand something, there are plenty of people to help.

I set up a wiki at http://www.dirvish.org/wiki, where users and contributors can add their information, patches, and requests for enhancements. In just a few months, there have already been some great contributions. We are accumulating patches for a small experimental release, while we develop techniques for testing the code for future general releases. By the time you read this, we should have an active mailing list.

There are still important management issues to address. The code needs more comments; ordinary programmers must be able to read it. We may veer from J.W.'s "no comment" approach to make the working environment safer. I am writing a design description of the software for the wiki. The configuration file format is confusing, with some things becoming Perl scalars and other things becoming lists in an overly restrictive way; we need to make the configuration task easier.

Finally, additional tools need to be created or borrowed from other packages. A more automated restore system is needed, one that easily adapts to changing partition needs and permits the selection and mixing of different backup images. Single file restore should be driven by users, with minimal intervention by systems administrators.

I may make all of these things happen, or none of them. Since dirvish is an open source software package, others are free to contribute improvements, bug fixes, or whole new interfaces. If others discover problems, they are free to discuss them in public and seek help from members of the dirvish community. Testing results are public; if something doesn't work, users can learn about the problem quickly. If I mismanage the project, others are free to fork the code, rename it, and abandon me to the contemplation of my errors.

Free software is about freedom and allows you to make your own right (or wrong) decisions or delegate them to the organizations that you choose. It gives you control. Free software is also about the benefits of cooperation and contribution. Even if you have a lot of clout with a proprietary software vendor, you are unlikely to get the enhancements you need, when you want them, and how you want them. The most direct way to get what you need in an open source package is to design and program the enhancements yourself. In the worst case, the enhancement you offer may get rejected by the maintainer and the other users; you can still apply your patches to your own version. A more likely outcome is that some of the other users will agree with you, and you can test, maintain, and improve your patch with others.

However, the most likely outcome is that your improvement will be gratefully accepted by maintainer and users alike and result in a cascade of improvements on your improvement. The maintainer will probably be grateful for the contribution, and other users are likely to have the same needs you do. You will not have to fight your way through layers of managers and marketers, and you will not have to fit some vendor's strategic plan. The needs of the contributors always outweigh the hypothetical needs of "future customers". Best of all, others will test your code in widely varying situations; by sharing your code, many mistakes will be found and corrected by others, before they affect you and your company.

Conclusions

Your contribution does not need to be large; it can be as simple as a single line of code, another test case, or even a spelling correction. The main difference between a successful or unsuccessful contribution is your own understanding of the intent of the software and how the community already uses it, and your willingness to cooperate with others. Please help us develop dirvish and rsync. Join us on the wiki at http://www.dirvish.org/wiki and discuss your needs, share your experiences, and offer your enhancements.

In the second part of this series, I'll discuss how to install and configure dirvish. I will set up an example network and install hardware and software to do dirvish disk-based backups. In the third part of the series, I will use these backups to do bare-metal restores of client and server hard drives.

Resources

KLIC -- http://www.kl-ic.com

Dirvish -- http://www.dirvish.org

Dirvish Wiki -- http://www.dirvish.org/wiki

Keith Lofstrom (http://www.keithl.com) owns an integrated circuit design consultancy in Beaverton, Oregon. His specialty is mixed-signal and statistical design for deep submicron processes, as well as design for testability using the IEEE 1149.x standards. Keith has been using some flavor of Unix since 1980, and although he admits to a brief flirtation with DOS and Windows, he has seen the error of his ways. He is currently managing the dirvish backup program.