jun2004.tar

Keeping Data in Sync::rsync

Chris Hare

Data management is an ongoing issue that plagues many companies. Organizations are constantly looking for ways to move data securely between systems in an automated fashion, keeping file systems or data files synchronized, or simply ensuring a group of systems have common data.

Keeping files synchronized in the enterprise generally means copying the file from one system to another. The assumption most organizations make is the availability of high-speed, low-latency networks where copying a large file is performed expediently. Although there are commercial tools to perform this type of data movement, many organizations cannot afford these tools or do not need the advanced functionality. This article presents the open source tool rsync, including configuration requirements, operational characteristics, and security implications.

This is the first of a two-part series examining rsync. In this article, I will describe what rsync is, how it works, and the interface between the client and the server. In the second part, I will cover configuration of the rsync server, access control, and security implications.

What Is rsync?

Rsync is essentially a replacement for the rcp (remote copy) command currently available in Unix. However, unlike rcp, rsync has many more capabilities, including the secure delivery of files using Secure Shell (SSH) as the underlying transport.

Rsync uses the rsync protocol for some of the transport operations, resulting in a more efficient and faster data transfer operation when compared with other methods. In particular, the rsync remote-update protocol provides specific benefits when the destination file already exists, by restricting transmission of the file to only the changes needed to keep the file in sync. If transmission of the changes is equivalent to transferring the entire file, rsync copies the entire file from one system to the other.

Rsync offers other distinct features, including:

Support for copying links, devices, owners, groups, and permissions
Exclude and exclude-from options similar to GNU tar
A CVS exclude mode for ignoring the same files that CVS would ignore
Use of any transparent remote shell, including rsh or ssh
No requirement for root privileges
Pipelining of file transfers to minimize latency costs
Support for anonymous or authenticated rsync servers

Rsync uses the rsync algorithm, providing a very fast method of determining the differences between files and only sending the differences to keep the fles synchronized. (See the technical paper at the rsync Web site, http://samba.org/rsync, for more information on the algorithm.) A principle advantage to the rsync protocol is the ability to determine and transmit the differences without the need for both files to be on the same system.

The rsync algorithm is highly efficient up to files that are 1 GB in size. Consequently, the benefits of using rsync over rcp and rdist are notably in the area of bandwidth consumption. The principle differences between rsync and other file synchronization tools are:

Copying only the differences -- Only the actual changed parts of the file are copied, rather than the whole file. This improves update speed, especially over lower speed links. Applications such as rcp, rdist, and FTP would transfer the entire file, even if only one byte had changed.
Compression -- The data to be transferred is compressed, further improving file transfer speed and bandwidth utilization.
Security -- Rsync can use several transport methods, including SSH for improved security. The rsync data stream is passed through the SSH tunnel, protecting the data. Use of a facility like ssh is required because neither the rsync nor the rsh protocol provides any real data security or protection capabilities.

The Pros and Cons

There are advantages and disadvantages to any given application. Table 1 lists some of these as they relate to using rsync. One of the noted disadvantages of rsync is its difficulty in processing files more than 4 GB in size. The rsync development community is working to address this and other elements of the rsync protocol and implementation.

How rsync Works

Like rdist, rsync uses rsh for communications between two hosts. This is the default configuration, although two other communication methods are available, including using ssh or the rsync server. Use of ssh allows for key-based authentication between the systems, eliminating the need for the use of .rhosts, hosts.equiv files, and better password management. Using ssh provides a higher degree of security both through improved authentication capabilities and the use of an encrypted transport. The configuration of ssh is beyond the scope of this article.

Unlike rdist, however, rsync does not require root privileges, nor must it run with setuid or other extended privileges. Rsync does require a functioning communication path using either rsh or ssh between the systems. Again, security enhancements are gained by using ssh as the transport over rsh. The rsync protocol operates both as a standalone service and as the network daemon using the ssh transport.

Rsync can work by performing file exchange and updates between two systems, or by running in a server mode where it can listen on a network socket and provide file distribution services. When operating in distribution mode or using the rsync server, authentication and access control capabilities are available. While the emphasis of this discussion is moving data between two systems, it is also possible to use rsync to send data to multiple systems -- as many as needed.

When keeping files synchronized, the easiest method is to simply copy the file from one system to another. However, if the network link is a low-bandwidth, high-latency circuit like dial-up IP, or the file is very large, copying the entire file can take a long time. The rsync technical white paper available on the rsync Web site illustrates the transfer times for the various protocols.

Rsync achieves high throughput and effective file transfers through the use of an aggressive rolling checksum capable of finding the differences in the file. Even with protocol overhead and checksums, it is less expensive in terms of bandwidth/time to rsync two files than transfer them.

Why Use rsync?

Many commercial products provide file synchronization, including CrossWorlds, DDS, and IBM MQSeries. However, commercial products incur licensing and maintenance costs that some organizations may want to avoid. The rsync solution provides smaller companies and even home-based businesses and individual users with the ability to synchronize data or back up files.

For example, Web or FTP administrators who need to maintain mirrors of their Web or FTP sites can do so using rsync without worrying about excessive bandwidth. This provides for redundant or load-balanced sites, as well as capacity and contingency planning. Additionally, rsync can be used to allow public file distribution using a non-FTP transport.

Since there are rsync clients and servers for both Unix and Microsoft Windows, a PC user can use rsync to keep data synchronized on an alternate system for data recovery purposes.

Using rsync

Using rsync is as simple as executing a one-line command to initiate the transport. However, before rsync will work, you must have a functioning rsh or ssh connection to the remote server. For example, to test the existence of an rsh connection to the target system, execute the following command:

[chare@gw chare]$ rsh -l chare localhost date
Permission denied.

If the connection to the remote system is successful, then rsync will work. If not, then the rsh connectivity must be corrected. (An error such as "Permission denied" generally means the target system doesn't have a .rhosts file for the user. This is one of the major security risks associated with the Berkeley r* commands.) Note that it is not required for rsync to use a .rhosts file unless the intent is to use the rsh protocol, which has inherent security issues.

Once a command can be executed on the local system using rsh or ssh, rsync is ready for use:

[chare@gw chare]$ rsh localhost -l chare date
Thu Feb 27 15:11:20 CST 2003

[chare@gw chare]$

Alternatively, ssh can be used as the transport mechanism to provide certificate-based authentication and an encrypted transport. Use of the encrypted transport is especially important if data must be transported across the Internet. Even if the data transport is across a private network, using ssh is the preferred method. To test your ssh connectivity to the remote server, execute the command:

[chare@gw chare]$ ssh localhost date
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is cc:f8:cc:90:8a:04:2f:24:88:4f:6e:e1:e3:a1:62:e6.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
chare@localhost's password:
Thu Feb 27 15:15:35 CST 2003
[chare@gw chare]$

If the connection to the remote system is successful, rsync is ready for use. If the ssh connection requires a password (as shown in the example) and rsync is going to be used in an automated process for the file transfers, a host-based certificate with no passphrase is required. Consult the ssh documentation for the correct procedure.

With the base connectivity to the remote system functioning, rsync is executed using a single command:

[chare@zrchy0je chare]$ rsync --verbose -e ssh video2.avi \
  chare@zrc2c01a:/home/chare/video2.avi
chare@zrc2c01a's password:
video2.avi
wrote 778656 bytes  read 36 bytes  74161.14 bytes/sec
total size is 778483  speedup is 1.00
[chare@zrchy0je chare]$

In the example, rsync is used to transfer the file video2.avi from the local system to the system named zrc2c01a. The -verbose option tells rsync to provide more information about the connection than normal. The -e option, which is followed by the ssh command, establishes ssh as the transport rather than rsh. (This option can be specified other ways as discussed later.) The next argument is the source file (video2.avi) followed by the destination. The destination file in this case consists of:

A user name (chare)
A target host (zrc2c01a)
A colon (:)
The target directory and or filename (/home/chare/video2.avi)

The colon is particularly important in the command execution. If there is only one colon, the ssh or rsh transport is used. If two colons are used (::), a connection to the rsync server on the remote system is attempted.

Here is a simple example of using rsync to back up a directory from one Unix server to another using ssh:

[chare@gw chare]$ rsync --verbose -e ssh -r work localhost:/tmp/
chare@localhost's password:
building file list ... done
work/393.zip
work/394.zip
work/395.zip
work/398.zip
wrote 27703908 bytes  read 84 bytes  2052147.56 bytes/sec
total size is 27700270  speedup is 1.00
[chare@gw chare]$

The -r option instructs rsync to recursively copy the files from the work directory to the remote host.

A major benefit of rsync is its ability to copy only the data it needs to copy. For example, if the 393.zip changes at some point, when the same command is executed again, this is the result:

[chare@gw chare]$ rsync --verbose -e ssh -r work localhost:/tmp/
chare@localhost's password:
building file list ... done
work/393.zip
work/394.zip
work/395.zip
work/398.zip
wrote 140424 bytes  read 145836 bytes  19742.07 bytes/sec
total size is 26760919  speedup is 93.48
[chare@gw chare]$

In this example, information from the file 393.zip had been removed, and the rsync protocol transmitted only the parts of the file that were necessary. This is evident in the difference between the number of bytes read versus the number of bytes written to the file. Note that, unfortunately, the rsync client doesn't make the number of bytes transferred for each file very clear unless the -progress option is used.

There are several different ways to use rsync. These are:

Copying the files on the local system. This occurs when neither the source nor destination files include a colon in the name. For example:
```
rsync /tmp/work /home/chare/work
```
behaves like an improved copy command moving only the files that have changed from /tmp/work to the user's directory.
Copying files from the local to a remote system using the rsh or ssh transports. This is invoked when either source or destination includes a colon in the pathname. For example:
```
rsync *.c target:source/
```
copies all files matching the pattern *.c from the current directory to a directory called "source" on the target system.
Copying files from a remote to the local system using the rsh or ssh transports. This is invoked when either source or destination includes a colon in the pathname. For example:
```
rsync -r target:data/ /data/tmp
```
recursively copies the files in the directory data from the remote system target to a directory called /data/tmp on the local system. The -r option is required for a directory to be processed.
Copying files from a remote rsync server when there is double colon (::) is the source filename, or a URL of rsync://filename.
Copying files from the local system to an rsync server when there is a double colon (::) in the destination filename or a URL of rsync://filename.
Copying from the remote to local system using a remote shell program as the transport from an rsync server on the remote system.
Copying from the local to remote system using a remote shell program as the transport from an rsync server on the remote system.
Listing the files available on a remote system, which is accomplished by omitting the destination from the command line. For example:
```
rsync somehost.mydomain.com::
```
lists the available rsync modules on the specified system. Rsync modules and each of the various methods of using rsync are described in the second part of this article.

In the event a file exists on the target system, the rsync protocol is used to detect the differences and transmit only the changes between the files. The operation of rsync can be changed through a variety of options, which are detailed in the rsync manual pages.
This File, but Not That File
The file selection capabilities of rsync are very strong. The user can specify file name patterns to include and exclude from the data transfer. The use of the strong file selection capabilities provides highly flexible file selection to determine which files to transfer and which ones to bypass.
When the filenames to transfer are processed from the command line, rsync evaluates the file against the include and exclude patterns and processes the first match. If the first match is an exclude pattern, the file is skipped. If the first pattern matched is an include pattern, the file is transferred.
The use of standard shell meta characters for filename globbing (*?[) can be used to generate the list of files by the shell, otherwise a simple string search is used to match the file names. Of special mention is the / character. If the filename to transfer includes a trailing /, the filename is matched against the entire pathname. Otherwise, rsync only looks at the filename component and ignores the path to the file.
Additionally, if the pattern starts with a + followed by a space, the files named are included to be transferred, even if they would normally be excluded. The same is true for the - character followed by a space, the file will be excluded, even if it would normally be included. It is important to consider that + and - characters can be used in both include and exclude lists. Since it is more likely to exclude files, the + and - can be used in a single exclude list, which has both exclude and include patterns.
Here is an example of using rsync in daemon mode, described later in the article, to transfer all the files in the www module, excluding those files ending in "iso":
```
rsync chare@alpha::www/ --recursive --exclude="*.iso +include.iso" --verbose I:
```
The command shown above is used to copy all the files an rsync module named "www" on the server alpha, as the user "chare". Each directory is traversed, and all files are copied from the remote system to the local system and stored in the directory "I:". The exclusion directive prevents copying files with an extension of .iso, with the exception of the named file, include.iso. This example illustrates using rsync from a Windows- or DOS-based system, which demonstrates the versatility of rsync.
Measuring Success
Because of the somewhat cryptic nature of the error messages in rsync, it may be a challenge to troubleshoot problems. For example, a very confusing error message is:
```
protocol version mismatch - is your shell clean?
```
This is typically caused by commands in the shell startup scripts on the remote system sending text. To determine whether remote connectivity is working as it should, execute the command:
```
rsh remotehost /bin/true > out.dat
```
or
```
ssh remotehost /bin/true > out.dat
```
depending upon the shell being used. If things are working properly, out.dat should be a zero-length file. If there is any text in the file, it should give some indication as to the problem. Once you have a zero-length file, rsync can use the shell transport successfully. If all else fails, try using the -vv option on the rsync command line as in:
```
rsync --vv -e ssh video2.avi
```
This will generate a significant quantity of statements to assist in debugging the problem.
Rsync also sets a series of exit codes allowing the administrator to determine whether rsync executed successfully in a script. These exit codes are listed in Table 2. Determining whether rsync executed correctly involves adding the appropriate code to your shell script to test the exit code upon completion. For example, the following Bourne/Korn shell commands could be used:
```
rsync --vv -e ssh video2.avi 2>&1 >/tmp/output
case "$?" in
   0)
      echo "rsync completed successfully"
      exit 0
      ;;
   *)
      echo "An error occurred.  Rsync returned $?"
      exit 1
      ;;
esac
```
This sample script executes rsync and evaluates the success based upon the exit or return code.
Rsync and the Shell Environment
Like most applications today, rsync functionality or operation can be changed using shell environment variables. While the variables are not required because many of the environment variables can also be designated using command-line options to rsync, I've included them here for completeness (see Table 3). This function is also important, since reliance upon environment variables can introduce security and operational problems if the variable contents are changed.
I have shown examples of executing rsync using the remote transport capabilities of rsh or ssh. Rsnyc also offers a strong server, including authentication and access control. However, before you can use the server, it must be configured, and I'll cover that in the next installment.
Summary
In the first of this two-part series on rsync, I presented what rsync is and why it is worthy of consideration as a file and data synchronization tool. I also presented the methods of using rsync to maintain data synchronization and using both the rsh and ssh transports.
Rsync is intended for use in situations where it is important to keep the data between multiple systems synchronized, such as Web or FTP servers. Using rsync to keep two Web servers synchronized means changes can be made to one Web server and automatically transferred to the others.
In the next part of this series, I will present additional examples of rsync in action, access controls, configuring the rsync server, and security concerns with rsync, which the systems administrator and rsync user must be aware of to ensure secure transport and data synchronization.
Acknowledgements
I'd like to thank Mignona Cote, a trusted friend and colleague, for her support during the development of this article. Mignona continues to provide ideas and challenges in topic selection and application, always with an eye for practical application of the information gained. Her insight into system and application controls serves her and her team effectively on an ongoing basis.
Chris Hare has more than 18 years experience in the computing industry with positions ranging from application design, quality assurance, systems administration, network analysis, and security consulting. Chris is the co-author of New Riders Publishing's Inside Unix, Internet Firewalls and Network Security; Building an Internet Server with Linux; and The Internet Security Professional Reference. He lives in Dallas, Texas and is employed with Nortel Networks as an Information Security and Control Consultant. He can be reached at: chare@chris-hare.com.