Keeping
Data in Sync::rsync
Chris Hare
Data management is an ongoing issue that plagues many companies.
Organizations are constantly looking for ways to move data securely
between systems in an automated fashion, keeping file systems or
data files synchronized, or simply ensuring a group of systems have
common data.
Keeping files synchronized in the enterprise generally means copying
the file from one system to another. The assumption most organizations
make is the availability of high-speed, low-latency networks where
copying a large file is performed expediently. Although there are
commercial tools to perform this type of data movement, many organizations
cannot afford these tools or do not need the advanced functionality.
This article presents the open source tool rsync, including configuration
requirements, operational characteristics, and security implications.
This is the first of a two-part series examining rsync. In this
article, I will describe what rsync is, how it works, and the interface
between the client and the server. In the second part, I will cover
configuration of the rsync server, access control, and security
implications.
What Is rsync?
Rsync is essentially a replacement for the rcp (remote
copy) command currently available in Unix. However, unlike rcp,
rsync has many more capabilities, including the secure delivery
of files using Secure Shell (SSH) as the underlying transport.
Rsync uses the rsync protocol for some of the transport operations,
resulting in a more efficient and faster data transfer operation
when compared with other methods. In particular, the rsync remote-update
protocol provides specific benefits when the destination file already
exists, by restricting transmission of the file to only the changes
needed to keep the file in sync. If transmission of the changes
is equivalent to transferring the entire file, rsync copies the
entire file from one system to the other.
Rsync offers other distinct features, including:
- Support for copying links, devices, owners, groups, and permissions
- Exclude and exclude-from options similar to GNU tar
- A CVS exclude mode for ignoring the same files that CVS would
ignore
- Use of any transparent remote shell, including rsh or ssh
- No requirement for root privileges
- Pipelining of file transfers to minimize latency costs
- Support for anonymous or authenticated rsync servers
Rsync uses the rsync algorithm, providing a very fast method of
determining the differences between files and only sending the differences
to keep the fles synchronized. (See the technical paper at the rsync
Web site, http://samba.org/rsync, for more information on
the algorithm.) A principle advantage to the rsync protocol is the
ability to determine and transmit the differences without the need
for both files to be on the same system.
The rsync algorithm is highly efficient up to files that are 1
GB in size. Consequently, the benefits of using rsync over rcp and
rdist are notably in the area of bandwidth consumption. The principle
differences between rsync and other file synchronization tools are:
- Copying only the differences -- Only the actual changed parts
of the file are copied, rather than the whole file. This improves
update speed, especially over lower speed links. Applications
such as rcp, rdist, and FTP would transfer the entire file, even
if only one byte had changed.
- Compression -- The data to be transferred is compressed, further
improving file transfer speed and bandwidth utilization.
- Security -- Rsync can use several transport methods, including
SSH for improved security. The rsync data stream is passed through
the SSH tunnel, protecting the data. Use of a facility like ssh
is required because neither the rsync nor the rsh protocol provides
any real data security or protection capabilities.
The Pros and Cons
There are advantages and disadvantages to any given application.
Table 1 lists some of these as they relate to using rsync. One of
the noted disadvantages of rsync is its difficulty in processing
files more than 4 GB in size. The rsync development community is
working to address this and other elements of the rsync protocol
and implementation.
How rsync Works
Like rdist, rsync uses rsh for communications between two hosts.
This is the default configuration, although two other communication
methods are available, including using ssh or the rsync server.
Use of ssh allows for key-based authentication between the systems,
eliminating the need for the use of .rhosts, hosts.equiv files,
and better password management. Using ssh provides a higher degree
of security both through improved authentication capabilities and
the use of an encrypted transport. The configuration of ssh is beyond
the scope of this article.
Unlike rdist, however, rsync does not require root privileges,
nor must it run with setuid or other extended privileges. Rsync
does require a functioning communication path using either rsh or
ssh between the systems. Again, security enhancements are gained
by using ssh as the transport over rsh. The rsync protocol operates
both as a standalone service and as the network daemon using the
ssh transport.
Rsync can work by performing file exchange and updates between
two systems, or by running in a server mode where it can listen
on a network socket and provide file distribution services. When
operating in distribution mode or using the rsync server, authentication
and access control capabilities are available. While the emphasis
of this discussion is moving data between two systems, it is also
possible to use rsync to send data to multiple systems -- as many
as needed.
When keeping files synchronized, the easiest method is to simply
copy the file from one system to another. However, if the network
link is a low-bandwidth, high-latency circuit like dial-up IP, or
the file is very large, copying the entire file can take a long
time. The rsync technical white paper available on the rsync Web
site illustrates the transfer times for the various protocols.
Rsync achieves high throughput and effective file transfers through
the use of an aggressive rolling checksum capable of finding the
differences in the file. Even with protocol overhead and checksums,
it is less expensive in terms of bandwidth/time to rsync two files
than transfer them.
Why Use rsync?
Many commercial products provide file synchronization, including
CrossWorlds, DDS, and IBM MQSeries. However, commercial products
incur licensing and maintenance costs that some organizations may
want to avoid. The rsync solution provides smaller companies and
even home-based businesses and individual users with the ability
to synchronize data or back up files.
For example, Web or FTP administrators who need to maintain mirrors
of their Web or FTP sites can do so using rsync without worrying
about excessive bandwidth. This provides for redundant or load-balanced
sites, as well as capacity and contingency planning. Additionally,
rsync can be used to allow public file distribution using a non-FTP
transport.
Since there are rsync clients and servers for both Unix and Microsoft
Windows, a PC user can use rsync to keep data synchronized on an
alternate system for data recovery purposes.
Using rsync
Using rsync is as simple as executing a one-line command to initiate
the transport. However, before rsync will work, you must have a
functioning rsh or ssh connection to the remote server. For example,
to test the existence of an rsh connection to the target system,
execute the following command:
[chare@gw chare]$ rsh -l chare localhost date
Permission denied.
If the connection to the remote system is successful, then rsync will
work. If not, then the rsh connectivity must be corrected. (An error
such as "Permission denied" generally means the target system doesn't
have a .rhosts file for the user. This is one of the major security
risks associated with the Berkeley r* commands.) Note that it is not
required for rsync to use a .rhosts file unless the intent is to use
the rsh protocol, which has inherent security issues.
Once a command can be executed on the local system using rsh or
ssh, rsync is ready for use:
[chare@gw chare]$ rsh localhost -l chare date
Thu Feb 27 15:11:20 CST 2003
[chare@gw chare]$
Alternatively, ssh can be used as the transport mechanism to provide
certificate-based authentication and an encrypted transport. Use of
the encrypted transport is especially important if data must be transported
across the Internet. Even if the data transport is across a private
network, using ssh is the preferred method. To test your ssh connectivity
to the remote server, execute the command:
[chare@gw chare]$ ssh localhost date
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is cc:f8:cc:90:8a:04:2f:24:88:4f:6e:e1:e3:a1:62:e6.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
chare@localhost's password:
Thu Feb 27 15:15:35 CST 2003
[chare@gw chare]$
If the connection to the remote system is successful, rsync is ready
for use. If the ssh connection requires a password (as shown in the
example) and rsync is going to be used in an automated process for
the file transfers, a host-based certificate with no passphrase is
required. Consult the ssh documentation for the correct procedure.
With the base connectivity to the remote system functioning, rsync
is executed using a single command:
[chare@zrchy0je chare]$ rsync --verbose -e ssh video2.avi \
chare@zrc2c01a:/home/chare/video2.avi
chare@zrc2c01a's password:
video2.avi
wrote 778656 bytes read 36 bytes 74161.14 bytes/sec
total size is 778483 speedup is 1.00
[chare@zrchy0je chare]$
In the example, rsync is used to transfer the file video2.avi from
the local system to the system named zrc2c01a. The -verbose
option tells rsync to provide more information about the connection
than normal. The -e option, which is followed by the ssh command,
establishes ssh as the transport rather than rsh. (This option can
be specified other ways as discussed later.) The next argument is
the source file (video2.avi) followed by the destination. The destination
file in this case consists of:
- A user name (chare)
- A target host (zrc2c01a)
- A colon (:)
- The target directory and or filename (/home/chare/video2.avi)
The colon is particularly important in the command execution.
If there is only one colon, the ssh or rsh transport is used. If
two colons are used (::), a connection to the rsync server on the
remote system is attempted.
Here is a simple example of using rsync to back up a directory
from one Unix server to another using ssh:
[chare@gw chare]$ rsync --verbose -e ssh -r work localhost:/tmp/
chare@localhost's password:
building file list ... done
work/393.zip
work/394.zip
work/395.zip
work/398.zip
wrote 27703908 bytes read 84 bytes 2052147.56 bytes/sec
total size is 27700270 speedup is 1.00
[chare@gw chare]$
The -r option instructs rsync to recursively copy the files
from the work directory to the remote host.
A major benefit of rsync is its ability to copy only the data
it needs to copy. For example, if the 393.zip changes at some point,
when the same command is executed again, this is the result:
[chare@gw chare]$ rsync --verbose -e ssh -r work localhost:/tmp/
chare@localhost's password:
building file list ... done
work/393.zip
work/394.zip
work/395.zip
work/398.zip
wrote 140424 bytes read 145836 bytes 19742.07 bytes/sec
total size is 26760919 speedup is 93.48
[chare@gw chare]$
In this example, information from the file 393.zip had been removed,
and the rsync protocol transmitted only the parts of the file that
were necessary. This is evident in the difference between the number
of bytes read versus the number of bytes written to the file. Note
that, unfortunately, the rsync client doesn't make the number of bytes
transferred for each file very clear unless the -progress option
is used.
There are several different ways to use rsync. These are:
- Copying the files on the local system. This occurs when neither
the source nor destination files include a colon in the name.
For example:
rsync /tmp/work /home/chare/work
behaves like an improved copy command moving only the files that
have changed from /tmp/work to the user's directory.
- Copying files from the local to a remote system using the rsh
or ssh transports. This is invoked when either source or destination
includes a colon in the pathname. For example:
rsync *.c target:source/
copies all files matching the pattern *.c from the current directory
to a directory called "source" on the target system.
- Copying files from a remote to the local system using the rsh
or ssh transports. This is invoked when either source or destination
includes a colon in the pathname. For example:
rsync -r target:data/ /data/tmp
recursively copies the files in the directory data from the remote
system target to a directory called /data/tmp on the local system.
The -r option is required for a directory to be processed.
- Copying files from a remote rsync server when there is double
colon (::) is the source filename, or a URL of rsync://filename.
- Copying files from the local system to an rsync server when
there is a double colon (::) in the destination filename or a
URL of rsync://filename.
- Copying from the remote to local system using a remote shell
program as the transport from an rsync server on the remote system.
- Copying from the local to remote system using a remote shell
program as the transport from an rsync server on the remote system.
- Listing the files available on a remote system, which is accomplished
by omitting the destination from the command line. For example:
rsync somehost.mydomain.com::
lists the available rsync modules on the specified system. Rsync
modules and each of the various methods of using rsync are described
in the second part of this article.
In the event a file exists on the target system, the rsync
protocol is used to detect the differences and transmit only
the changes between the files. The operation of rsync can be
changed through a variety of options, which are detailed in
the rsync manual pages.
This File, but Not That File
The file selection capabilities of rsync are very strong.
The user can specify file name patterns to include and exclude
from the data transfer. The use of the strong file selection
capabilities provides highly flexible file selection to determine
which files to transfer and which ones to bypass.
When the filenames to transfer are processed from the command
line, rsync evaluates the file against the include and exclude
patterns and processes the first match. If the first match is
an exclude pattern, the file is skipped. If the first pattern
matched is an include pattern, the file is transferred.
The use of standard shell meta characters for filename globbing
(*?[) can be used to generate the list of files by the shell,
otherwise a simple string search is used to match the file names.
Of special mention is the / character. If the filename
to transfer includes a trailing /, the filename is matched against
the entire pathname. Otherwise, rsync only looks at the filename
component and ignores the path to the file.
Additionally, if the pattern starts with a + followed
by a space, the files named are included to be transferred,
even if they would normally be excluded. The same is true for
the - character followed by a space, the file will be
excluded, even if it would normally be included. It is important
to consider that + and - characters can be used
in both include and exclude lists. Since it is more likely to
exclude files, the + and - can be used in a single
exclude list, which has both exclude and include patterns.
Here is an example of using rsync in daemon mode, described
later in the article, to transfer all the files in the www module,
excluding those files ending in "iso":
rsync chare@alpha::www/ --recursive --exclude="*.iso +include.iso" --verbose I:
The command shown above is used to copy all the files an rsync
module named "www" on the server alpha, as the user "chare". Each
directory is traversed, and all files are copied from the remote
system to the local system and stored in the directory "I:". The
exclusion directive prevents copying files with an extension of
.iso, with the exception of the named file, include.iso. This
example illustrates using rsync from a Windows- or DOS-based system,
which demonstrates the versatility of rsync.
Measuring Success
Because of the somewhat cryptic nature of the error messages
in rsync, it may be a challenge to troubleshoot problems. For
example, a very confusing error message is:
protocol version mismatch - is your shell clean?
This is typically caused by commands in the shell startup scripts
on the remote system sending text. To determine whether remote
connectivity is working as it should, execute the command:
rsh remotehost /bin/true > out.dat
or
ssh remotehost /bin/true > out.dat
depending upon the shell being used. If things are working properly,
out.dat should be a zero-length file. If there is any text in
the file, it should give some indication as to the problem. Once
you have a zero-length file, rsync can use the shell transport
successfully. If all else fails, try using the -vv option
on the rsync command line as in:
rsync --vv -e ssh video2.avi
This will generate a significant quantity of statements to assist
in debugging the problem.
Rsync also sets a series of exit codes allowing the administrator
to determine whether rsync executed successfully in a script.
These exit codes are listed in Table 2. Determining whether
rsync executed correctly involves adding the appropriate code
to your shell script to test the exit code upon completion.
For example, the following Bourne/Korn shell commands could
be used:
rsync --vv -e ssh video2.avi 2>&1 >/tmp/output
case "$?" in
0)
echo "rsync completed successfully"
exit 0
;;
*)
echo "An error occurred. Rsync returned $?"
exit 1
;;
esac
This sample script executes rsync and evaluates the success based
upon the exit or return code.
Rsync and the Shell Environment
Like most applications today, rsync functionality or operation
can be changed using shell environment variables. While the
variables are not required because many of the environment variables
can also be designated using command-line options to rsync,
I've included them here for completeness (see Table 3). This
function is also important, since reliance upon environment
variables can introduce security and operational problems if
the variable contents are changed.
I have shown examples of executing rsync using the remote
transport capabilities of rsh or ssh. Rsnyc also offers a strong
server, including authentication and access control. However,
before you can use the server, it must be configured, and I'll
cover that in the next installment.
Summary
In the first of this two-part series on rsync, I presented
what rsync is and why it is worthy of consideration as a file
and data synchronization tool. I also presented the methods
of using rsync to maintain data synchronization and using both
the rsh and ssh transports.
Rsync is intended for use in situations where it is important
to keep the data between multiple systems synchronized, such
as Web or FTP servers. Using rsync to keep two Web servers synchronized
means changes can be made to one Web server and automatically
transferred to the others.
In the next part of this series, I will present additional
examples of rsync in action, access controls, configuring the
rsync server, and security concerns with rsync, which the systems
administrator and rsync user must be aware of to ensure secure
transport and data synchronization.
Acknowledgements
I'd like to thank Mignona Cote, a trusted friend and colleague,
for her support during the development of this article. Mignona
continues to provide ideas and challenges in topic selection
and application, always with an eye for practical application
of the information gained. Her insight into system and application
controls serves her and her team effectively on an ongoing basis.
Chris Hare has more than 18 years experience in the computing
industry with positions ranging from application design, quality
assurance, systems administration, network analysis, and security
consulting. Chris is the co-author of New Riders Publishing's
Inside Unix, Internet Firewalls and Network Security; Building
an Internet Server with Linux; and The Internet Security
Professional Reference. He lives in Dallas, Texas and is
employed with Nortel Networks as an Information Security and
Control Consultant. He can be reached at: chare@chris-hare.com. |