Taming
the Distributed Database Problem: A Case Study Using MySQL
Giuseppe Maxia
There are situations where data belonging to a given context are
nonetheless scattered among different servers, possibly in separate
locations. In such situations, gathering data together is more difficult
than dealing with a unified set of tables, where all of them were
under the same engine. The challenge is purely technological. It
is easy to create a virtual file system, making the database engine
believe that all the files are in the same machine, even though
they are just a link over the network, perhaps hundreds of miles
apart. However, in such a case the performance would be so unfavorable
that the whole purpose of having a DBMS would be mystified. Hence,
the distributed database studies, which try to find clever ways
of reducing the technological gaps and make normal operations look,
if not ordinary business, at least affordable.
Distributed database theoreticians often debate the capability
of performing JOIN operations across distant databases. There are
some astute methods to achieve such goals, and sometimes with acceptable
performance. A JOIN between two tables in different databases, with
a WHERE clause involving both systems, could be carried out quite
fast. But things are not always that easy. There are some classes
of queries that require such a strain and interaction between tables
that their cost in a remote call could be prohibitive. As an example,
consider a query where we want to find which rows are different
in two given tables or which rows are missing from one of them.
Performing such a task in a distributed database could be a challenging
exercise. In this article, I will examine exactly this kind of query
to explore some alternative ways of remote comparison.
Remote Comparison
Comparing tables that belong to different databases implies examining
the entities at three levels:
- Table structure
- Amount of data
- Data contents
The importance of the first two levels should be straightforward.
It is useless to compare data if we are not sure about the containers
(i.e., the tables). If the tables have a different structure, any
difference I may find will be meaningless. The same goes for the
quantity of data available. If I use some statistical methods to
compare the data in two tables, they must have the same number of
records for the results to be of any use.
Fortunately, the first two levels are also the easiest to deal
with. Comparing two table structures with the features that any
DBMS offers is fast and accurate. There are already specialized
tools that perform such tasks efficiently, and the row counting
part is trivial.
Thus, I am left with the third level -- the one that troubles
me most. Before getting into more detail, let me first explain how
I got into this perilous field.
Background
Until a few months ago, I was in charge of developing and maintaining
a large human resources database. It was a work in progress, being
migrated from a legacy system with new features being developed
constantly, so I found it useful to install two servers -- one for
development, and one for production (Figure 1).
My first encounter with remote comparison was when the new features
in the development server needed some adjustments to their structure.
Before committing the changes in the production server, I tested
the new features in the development one, and I kept a log of the
changes, which I then replayed to the main server. There was no
problem with this strategy until the day I missed an item in the
log file and thus didn't update the main server correctly.
A few weeks later, the whole issue forgotten, it took me several
hours to nail down the bug. In the process, I found plenty of tools
to compare the structure of two databases and even to produce automatically
the necessary ALTER TABLE statements to fill the gaps.
But the problems were not that simple. Besides the development/production
duality, I also had a data warehouse to facilitate heavy number
crunching for the statistics that the management required on a daily
basis. Because the data warehouse needed development, migration,
and maintenance, it was also duplicated. This was no big deal at
the beginning, but data warehouses naturally grow large very quickly.
So I had to devise a mechanism to populate each of them within their
respective servers.
Everything went well and the data warehouse produced statistics
that never got more than a distracted glance, until one day, 30
minutes before a big meeting, somebody realized that there was a
.02% inconsistency in the statistics compared to the previous month.
So, of course, I was tasked with fixing it.
I compared the numbers in the development server with the main
one and found that the numbers in the test server were correct.
I printed the statistics with 19 minutes to spare, but after everyone
had left I was still there trying to find out the difference between
two huge tables in two different servers. Once all the users had
logged out, I copied the tables from the development to the production
server, which kept the servers busy for the better part of 3 hours,
and then compared the two tables locally and found the culprit.
This incident led me to more serious thinking about remote comparison,
and here are the results. Remote comparison is also useful for purposes
such as establishing a mirror database or checking the correctness
of a replication engine. You never know when such a need will arise
-- unless, of course, you are developing a distributed database
system -- and therefore these notes may turn out to be useful.
Brute Force Attacks
I used a brute force approach on my data warehouse (Figure 2a),
which is not recommended, because it transfers all the records from
one server to another. So, if your table has one million records,
you move all of them across the network. Additionally, the receiving
server might have to recreate the tables from raw data, depending
on the method you choose to do the transfer.
Another common approach, though, may be even worse. When this
problem pops up in a programming forum, usually somebody suggests
querying both servers and comparing the data sets in the client
(Figure 2b). What they don't realize is that they are moving two
million records across the network, making two unhappy DBAs and
an unhappy sys admin in the bargain.
Comparing Other Things
There is something to learn from the methods used to compare things
other than database tables. For instance, let's see how to compare
two binary files.
Unless you want to pass the file data from one computer to another,
there is a commonly used procedure to make sure that a file in your
machine is the same as the one that's in a distant server. It happens,
for example, when you download a large file, such as a CD ISO image,
from a Web site. Besides the file, the site may provide the MD5
signature of the file (a 128-bit number), which is usually provided
as a 32-character string. Once you have downloaded the file, you
can calculate locally the MD5 for it and compare the result with
the one in the site. If these match, you know that the download
was successful. The benefit of this method is that you can compare
a small amount of data that represents the real stuff instead of
sending large chunks of bytes across a network to perform a comparison.
A similar approach is feasible to compare one single row in a
database table. You can either send all the row fields and then
compare each one, or you can calculate some sort of CRC for the
row data and compare just that. Once more, the MD5 function is a
preferred choice for the task:
SELECT MD5(CONCAT(field1, field2, field3, field4)) as signature from MYTABLE;
This query will return a 32-character string -- quite convenient if
the fields in the table include some large BLOB. The comparison is
then trivial for any client.
However, even this knowledge does not lead to a definite solution.
How can I calculate a CRC for one table instead of one row? There
is no answer, except, perhaps, that someone would apply the file
CRC method to compare two database files. It would be a mistake,
though. Even if all the DBMS engines used a single file for each
table -- and that is not true -- there is no guarantee that the
same data is stored in the same way. Too many factors determine
the file format, from the file system to the paging size, the order
in which the records were stored, and so on. A binary comparison
for a table would be unreliable.
Determining Whether Two Tables are Different
The CRC approach, while misleading if applied to a file, could
still be used for remote comparison, provided that you think outside
the box. Here's an example. You want to know whether two tables
have any difference at all. This information is useful when you
only need to know whether a coping mechanism is working properly,
such as when monitoring a replication server. You don't need to
know what is different, but the mere fact that there is divergence
should ring a bell for the DBA. So, here goes:
SELECT SUM(
CONV(
SUBSTRING(
@CRC:=MD5(CONCAT_WS('/',field1, field2, field3 ...)),
1, 8),
16, 10)) AS sig1,
SUM(CONV(SUBSTRING(@CRC, 9, 8), 16, 10)) AS sig2,
SUM(CONV(SUBSTRING(@CRC, 17, 8), 16, 10)) AS sig3,
SUM(CONV(SUBSTRING(@CRC, 25, 8), 16, 10)) AS sig4
FROM
mytable;
What I get from this query is a set of four numbers, which together
can represent the whole table, just as a CRC can represent one row.
Let's walk through the innards of this statement, which will become
handy very soon. Since the functions are nested, we'll start from
the innermost ones and work outwards.
CONCAT_WS is a function that merges all its arguments, separated
by the string given as a first argument. On the resulting string,
I apply the MD5 function, which gives me a string of 32 characters.
That string is then assigned to the user variable @CRC. In MySQL,
a user variable is something that you can assign to a value and
use later in the same query or in a different one within the same
session. One interesting feature of user variables is that they
get evaluated only once per row. Thus, I assign the result of MD5
to @CRC to avoid a recalculation for each substring.
Speaking of which, it is the next function that gets evaluated.
I extract the first eight characters from @CRC and pass them to
the CONV function, which in turn converts the string from base 16
to base 10. Why? I use CONV because I need to do some arithmetic
operation (SUM) on the MD5 result, but I can't do it straight away
because the MD5 result is a string, not a number. So, why didn't
I pass @CRC directly to CONV? Why did I have to split @CRC into
four pieces? There are two reasons. The first is that CONV only
accepts a 64-bit integer argument; thus, the output from MD5 would
be truncated. The second reason is that, even if I could pass the
whole number to CONV, the SUM of it would have a fair chance of
an overflow.
Once I overcome the limitations set by the DBMS, I get four fields,
each one of them a portion of the MD5 calculation, summarized for
the entire table. If this query produces the same results in both
databases, then the two tables have the same contents. A mismatch
in any of the fields would indicate that at least one field in one
row has a different value. A handy simplification would be to enclose
the four sums into a CONCAT function, so that you only get one result
at the price of an additional CPU load for the server. Listing 1
shows an example Perl script to compare two remote tables.
A Clever but Unfocused Solution
Now I have explored all the elements to get to the solution, and
I would like to show you how I eventually came to it. The path involved
some mistakes, as usual. So, let me walk you through an incorrect
solution, which bears some very useful aspects.
I know that I can get a CRC for one row, and I can summarize an
entire table and get a global CRC. But, how can I apply such knowledge
to find which rows are actually different? I know that the two servers
can't interact in this respect without exchanging an unacceptably
large amount of data, so I must apply some sort of divide and conquer
algorithm to refine the method and get the CRC of a range of rows.
This should not be too hard. It should be enough to use the query
seen above, filtered with a WHERE clause. Shouldn't it? Theoretically,
yes. Actually, it is doable, and it would be convenient if I had
to run the query only once. But for any useful purpose, the query
must be able to narrow down the range (or ranges) where the differences
lie, therefore it must run several times, with a decreasing range,
until the goal is achieved.
To reduce the overhead of calculating a heavy CRC for each row,
I create a shadow table with only the minimum information needed
to drill down the search. Given a source table "employees", with
the definition in Figure 3, the shadow table would have the following
structure:
CREATE TABLE shadow_table (
signature char(32) not null primary key,
sig1 BIGINT not null,
sig2 BIGINT not null,
sig3 BIGINT not null,
sig4 BIGINT not null,
PK int not null,
unique key PK(PK));
And I would fill the table with one query:
INSERT INTO shadow_table
SELECT
@CRC:=MD5(
CONCAT_WS('/',employee_code,gender,name,DOB,salary,dept_code,loc_code)),
CONV(SUBSTRING(@CRC, 1,8),16,10),
CONV(SUBSTRING(@CRC, 9,8),16,10),
CONV(SUBSTRING(@CRC,17,8),16,10),
CONV(SUBSTRING(@CRC,25,8),16,10),
employee_code
FROM employees;
The significance of this query should be clear from the previous explanation.
Now I need to apply a sensible algorithm to determine which rows
are different. After some thinking, I decided to apply the binary
search, an established routine with a well-deserved reputation of
being efficient.
Here is the outline. To begin, I test the whole table to see whether
there is any difference at all. With the shadow table, the query
is much nicer and faster:
SELECT
SUM(sig1), SUM(sig2), SUM(sig3), SUM(sig4)
FROM
shadow_table;
As for the binary search itself, its implementation is a little trickier
than a normal search, due to the possibility that there is more than
one difference between the two tables. Assuming that the tables have
one thousand rows each, and there are two different records, at ID
750 and 900, the search goes like this:
1. Check the CRC of the ranges 1..512 and 513..1024 (these ranges
work better if they are powers of two):
SELECT
SUM(sig1), SUM(sig2), SUM(sig3), SUM(sig4)
FROM
shadow_table
WHERE
PK BETWEEN 1 AND 512;
From this, I get two ranges of values for each server, which I compare
in the client.
2. For the first range, I have a match, meaning that there is
no difference in the first 512 rows. The values for the second range
are different, so I split them again and perform another search.
3. In the next comparison, I get a mismatch in both ranges (513-768,
769-1024), meaning that both have at least one difference. I keep
both ranges and split further. Here the algorithm starts to differ
from a common binary search, where I retain only one range.
4. After splitting again, I narrow down the ranges until eventually
I get the one-line ranges that point me to the differing rows (Figure
4).
It took ten queries for each server to get the result, and I was
only transferring tiny portions of data across the network. For
a table of one million records, finding a difference would require
about twenty queries. It seems that I struck gold, but unfortunately,
it does not always work like that.
There are best cases, such as one difference in one million, and
there are worst cases, say one difference every one hundred records.
In the worst cases, this search performs so badly that you may wish
you had used brute force instead.
What then? The principle seemed right, but the implementation
was poor. So I resorted to Lasker's principle. Emanuel Lasker was
a chess champion about a century ago, and he left a lasting impression
on chess philosophy. Among other, more technical teachings, he left
an aphorism that I try to apply to several fields, not just chess:
"If you see the good move, look for the better one." This idea made
me think harder about a better solution.
Refactoring
I was traveling by train, preparing a speech I was giving at the
MySQL Users' Conference. Train trips can work wonders for your thoughts,
if you offer yourself completely to them, without the distraction
of driving. So there I was, thinking about database performance
with the background lullaby of a train rattling, when the idea came
to me. I acknowledged to myself that my previous reasoning was faulty,
and that I was doing with the server what I would do with a client,
instead of exploiting the intrinsic power of the database engine.
After this process of self-flagellation, I analyzed the algorithm
again and found the weak points. Besides trying to mimic a typical
client behavior, the binary search with CRC has the disadvantage
that the same record is taken into account several times, thus leading
to a large amount of unnecessary calculation. For instance, record
number 750 was calculated first in the range 1-1024, then again
in the range 512-1024, then in the range 513-768, and so on. This
just did not sound right. I was wasting resources and needed a different
approach to minimize the calculation and make the task more efficient.
So, here is the new idea. I calculate the ranges in an arbitrary
way, just once, and then the search within the range will be left
to the database engine. To do this, I must modify the structure
of the shadow table from which I decide to remove the sigX fields
and instead add a column for grouping purposes:
CREATE TABLE shadow_table
(signature CHAR(16) NOT NULL PRIMARY KEY,
PK BIGINT NOT NULL,
span BIGINT UNSIGNED NOT NULL,
UNIQUE KEY PK(PK),
KEY span (span));
The span column will hold the value (referred to the primary key)
that the main query will use for grouping the range CRCs. Filling
the table is faster than the previous way:
INSERT INTO
shadow_table (signature, PK, span)
SELECT
MD5(CONCAT_WS('/',
employee_code,gender,name,DOB,salary,dept_code, loc_code)),
employee_code, employee_code / 1024
FROM
employees;
The filling query is only calculating the CRC for each row. The numbers
needed to pass to SUM are generated on the fly in the main query.
Here:
SELECT
span, COUNT(*) as cnt,
CONCAT(SUM(CONV(SUBSTRING(signature,1,8),16,10)),
SUM(CONV(SUBSTRING(signature,9,8),16,10)),
SUM(CONV(SUBSTRING(signature,17,8),16,10)),
SUM(CONV(SUBSTRING(signature,25,8),16,10))) as sig
FROM
shadow_table
GROUP BY
span;
See Listing 2.
Compared to the binary search, the first query returns many more
rows than before. For one million records, I get about one thousand
rows from each server, and then I compare the result sets in the
client. But I don't have to do any more calculations.
In the client, for each range that does not match, I run two more
queries. The query to the first server asks for all the signatures
in that given range. To the second server, I send a special query
asking it to return all the rows that don't match the signatures
from the first server (Figure 5):
# first server
SELECT signature FROM shadow_table WHERE span = XX;
# I get signatures "A2F", "B4E", "C1D", etc
#second server
SELECT
gender, name, DOB, salary
FROM
shadow_table
INNER JOIN employees ON (PK=employee_code)
WHERE
span = XX
AND signature NOT IN ("A2F", "B4E", "C1D", ... );
The second query would be a nightmare to create by hand, but fortunately
it could be a breeze to build with some handy Perl idioms.
my $columns = $sth1->fetchall_arrayref();
my $query = qq{SELECT PK FROM check_table WHERE span = $_ AND
signature NOT IN (}
. (join(",", map {"'$_->[0]'"} @$columns ))
. ")";
See Listing 3.
The whole process is much less painful than the binary search.
When I first tried a comparison with this method, it was so much
faster I thought that I had blundered somewhere or that perhaps
my solution was just an illusion. But a few robust tests confirmed
that not only does this method work well, but it also runs from
two to five times faster than the binary search.
Limits
This algorithm does not cover every possible case. For educational
purposes and for space limitations, I applied it only to the cases
where the tables have a numerical and non-composite Primary Key.
The algorithm can be applied to other cases as well, but its explanation
comes at the price of much more paper to print. Second, this algorithm
does not catch the cases where one record is deleted from one table.
Fortunately, this side problem can be easily handled by the method
shown in the next section.
Finding Missing Records
Before venturing into the table comparison, it would be good practice
to check whether the tables have the same number of records and,
if they don't, try to find which ones are missing. Finding missing
records is a less demanding task than finding the differences, because
the operation at stake is just counting -- this feature is built
into the database engine and is performed quite efficiently.
The system goes much like the one seen for differences. In each
server, run a query to get a count of records by a given range,
then compare the lists in the client and build a further query to
find the individual missing rows. Because I am counting, (instead
of comparing) CRCs, the query must follow a specific order. When
comparing the two sets, I get the one with lower count, get the
primary keys for that range, then ask the other server to return
the records for the range where the given primary keys do not match
(Listing 4).
By combining the difference and the counting methods, I should
be able to catch any difference between two remote tables. There
are two caveats:
1. Choosing an unsound grouping column or expression may lead
to incorrect results when both tables have deleted records. The
best candidate for this kind of grouping is a single-column numerical
primary key divided by a fixed number. My favorite is FLOOR(PKcolumn
/ 1024).
2. Making ranges too big may result in a query larger than the
maximum packet size that MySQL allows. For example, if you make
a range of 100,000, then the query to send to the second server
will be approximately 600 KB for counting and 1.8 MB for differencing
(calculate the size of one "signature", plus commas and quotes.)
Because the default size for a packet is 1 MB, your query may be
refused.
Some Speculation on CRC Effectiveness
A CRC is a process of constraining information into a standard
representation. For the purpose of this research, the CRC seems
to be up to the task. By representing a whole record with a CRC
string, I can reduce the amount of information that I need to exchange
between two servers. There is a conceptual downside, though. Although
a record is a reasonably small amount of data, a range of 1000 records
is a fairly large chunk of bits. When I was squeezing CRCs into
a simple sum, I had the feeling that I was reducing my capability
of properly describing the data. If I'd had a proper CRC function
that I could apply to a range of records, I would not have been
very concerned. After some brainstorming with a few IT enthusiasts,
I concluded that there was at least the theoretical possibility
of a collision when using a sum to replace a proper CRC.
The reasoning goes like this. I have N numbers, which when summed
give a result. It is possible to find another set of N numbers that
sum up to the same result. For example, given the total 9, I can
reach that result by summing up 5+4, 6+3, 7+2, 8+1, 9+0. Transferring
the calculation to the kind of numbers that I am using in this article,
how often can I expect such a collision? This thought was bugging
me for a while, so I calculated its probability.
Using 128-bit numbers, the probability of having a collision (i.e.,
at least two numbers that sum up to the same result) is .000000000000000000000000000000000000002938
(2.93e-39).
By comparison, the probability of winning first prize in the U.S.
lottery is .00000001 (1.0e-08). This string of zeros has considerably
reassured me about the unlikeness of a collision. Therefore, for
the moment I assume that the method is reasonably safe.
A Viable Alternative Using Stored Procedures
Using the freshly released stored procedures available in MySQL
version 5.0.0, an alternative solution would be to calculate a more
robust CRC with a cursor:
CREATE PROCEDURE rangeCRC()
BEGIN
DECLARE done INT DEFAULT 0;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000'
SET done = 1;
DECLARE c1 CURSOR FOR
SELECT MD5(CONCAT_WS('/',
employee_code,gender,name,DOB,salary,dept_code, loc_code)) FROM employees;
DECLARE tmpCRC CHAR(32) ;
OPEN c1;
SET @globalCRC = repeat('0',32);
REPEAT
FETCH c1 INTO tmpCRC;
IF NOT done THEN
SET @globalCRC = MD5(concat(@globalCRC, tmpCRC));
END IF;
UNTIL done
END REPEAT;
CLOSE c1;
# Now @globalCRC contains the range CRC for table employees
END;
MySQL 5.0.0 is still experimental, and therefore I can't recommend
this method as of today. However, when 5.x reaches a stable release,
I can foresee its usage as a way for easing the path toward a distributed
database system.
Planning for a Distributed Database Server
If you have gone through all the above steps, you must have realized
that remote comparison is not rocket science. The complication arises
from the engine that is not equipped for such a task.
An automatic CRC signature for each record (MySQL has it already)
and an appropriate function to access its value would make the shadow
table redundant. And, a built-in function to calculate range CRCs
would speed up the process even further, besides making it more
accurate.
This method needs to be organized and made compatible with the
rest of the existing engine, perhaps making the CRC optional, in
order to reduce the overhead when this feature is not needed, but
it is not impractical. A real distributed DBMS could be really around
the corner.
Notes
Thanks to Vahe Sarkissian and Alan Ferrency for commenting an
early draft of this document and for their valuable advice.
To compare the structure of two databases, you can use a tool
developed by Adam Spiers -- http://search.cpan.org/author/ASPIERS/MySQL-Diff-0.33/
Giuseppe Maxia is a freelance database consultant and CTO of
an Italian company specializing in Open Source solutions (www.stardata.it).
He is widely known in the Perl and MySQL community through his articles
and software contributions. |