Cover V13, i08
aug2004.tar

Taming the Distributed Database Problem: A Case Study Using MySQL

Giuseppe Maxia

There are situations where data belonging to a given context are nonetheless scattered among different servers, possibly in separate locations. In such situations, gathering data together is more difficult than dealing with a unified set of tables, where all of them were under the same engine. The challenge is purely technological. It is easy to create a virtual file system, making the database engine believe that all the files are in the same machine, even though they are just a link over the network, perhaps hundreds of miles apart. However, in such a case the performance would be so unfavorable that the whole purpose of having a DBMS would be mystified. Hence, the distributed database studies, which try to find clever ways of reducing the technological gaps and make normal operations look, if not ordinary business, at least affordable.

Distributed database theoreticians often debate the capability of performing JOIN operations across distant databases. There are some astute methods to achieve such goals, and sometimes with acceptable performance. A JOIN between two tables in different databases, with a WHERE clause involving both systems, could be carried out quite fast. But things are not always that easy. There are some classes of queries that require such a strain and interaction between tables that their cost in a remote call could be prohibitive. As an example, consider a query where we want to find which rows are different in two given tables or which rows are missing from one of them. Performing such a task in a distributed database could be a challenging exercise. In this article, I will examine exactly this kind of query to explore some alternative ways of remote comparison.

Remote Comparison

Comparing tables that belong to different databases implies examining the entities at three levels:

  • Table structure
  • Amount of data
  • Data contents

The importance of the first two levels should be straightforward. It is useless to compare data if we are not sure about the containers (i.e., the tables). If the tables have a different structure, any difference I may find will be meaningless. The same goes for the quantity of data available. If I use some statistical methods to compare the data in two tables, they must have the same number of records for the results to be of any use.

Fortunately, the first two levels are also the easiest to deal with. Comparing two table structures with the features that any DBMS offers is fast and accurate. There are already specialized tools that perform such tasks efficiently, and the row counting part is trivial.

Thus, I am left with the third level -- the one that troubles me most. Before getting into more detail, let me first explain how I got into this perilous field.

Background

Until a few months ago, I was in charge of developing and maintaining a large human resources database. It was a work in progress, being migrated from a legacy system with new features being developed constantly, so I found it useful to install two servers -- one for development, and one for production (Figure 1).

My first encounter with remote comparison was when the new features in the development server needed some adjustments to their structure. Before committing the changes in the production server, I tested the new features in the development one, and I kept a log of the changes, which I then replayed to the main server. There was no problem with this strategy until the day I missed an item in the log file and thus didn't update the main server correctly.

A few weeks later, the whole issue forgotten, it took me several hours to nail down the bug. In the process, I found plenty of tools to compare the structure of two databases and even to produce automatically the necessary ALTER TABLE statements to fill the gaps.

But the problems were not that simple. Besides the development/production duality, I also had a data warehouse to facilitate heavy number crunching for the statistics that the management required on a daily basis. Because the data warehouse needed development, migration, and maintenance, it was also duplicated. This was no big deal at the beginning, but data warehouses naturally grow large very quickly. So I had to devise a mechanism to populate each of them within their respective servers.

Everything went well and the data warehouse produced statistics that never got more than a distracted glance, until one day, 30 minutes before a big meeting, somebody realized that there was a .02% inconsistency in the statistics compared to the previous month. So, of course, I was tasked with fixing it.

I compared the numbers in the development server with the main one and found that the numbers in the test server were correct. I printed the statistics with 19 minutes to spare, but after everyone had left I was still there trying to find out the difference between two huge tables in two different servers. Once all the users had logged out, I copied the tables from the development to the production server, which kept the servers busy for the better part of 3 hours, and then compared the two tables locally and found the culprit.

This incident led me to more serious thinking about remote comparison, and here are the results. Remote comparison is also useful for purposes such as establishing a mirror database or checking the correctness of a replication engine. You never know when such a need will arise -- unless, of course, you are developing a distributed database system -- and therefore these notes may turn out to be useful.

Brute Force Attacks

I used a brute force approach on my data warehouse (Figure 2a), which is not recommended, because it transfers all the records from one server to another. So, if your table has one million records, you move all of them across the network. Additionally, the receiving server might have to recreate the tables from raw data, depending on the method you choose to do the transfer.

Another common approach, though, may be even worse. When this problem pops up in a programming forum, usually somebody suggests querying both servers and comparing the data sets in the client (Figure 2b). What they don't realize is that they are moving two million records across the network, making two unhappy DBAs and an unhappy sys admin in the bargain.

Comparing Other Things

There is something to learn from the methods used to compare things other than database tables. For instance, let's see how to compare two binary files.

Unless you want to pass the file data from one computer to another, there is a commonly used procedure to make sure that a file in your machine is the same as the one that's in a distant server. It happens, for example, when you download a large file, such as a CD ISO image, from a Web site. Besides the file, the site may provide the MD5 signature of the file (a 128-bit number), which is usually provided as a 32-character string. Once you have downloaded the file, you can calculate locally the MD5 for it and compare the result with the one in the site. If these match, you know that the download was successful. The benefit of this method is that you can compare a small amount of data that represents the real stuff instead of sending large chunks of bytes across a network to perform a comparison.

A similar approach is feasible to compare one single row in a database table. You can either send all the row fields and then compare each one, or you can calculate some sort of CRC for the row data and compare just that. Once more, the MD5 function is a preferred choice for the task:

SELECT MD5(CONCAT(field1, field2, field3, field4)) as signature from MYTABLE;
This query will return a 32-character string -- quite convenient if the fields in the table include some large BLOB. The comparison is then trivial for any client.

However, even this knowledge does not lead to a definite solution. How can I calculate a CRC for one table instead of one row? There is no answer, except, perhaps, that someone would apply the file CRC method to compare two database files. It would be a mistake, though. Even if all the DBMS engines used a single file for each table -- and that is not true -- there is no guarantee that the same data is stored in the same way. Too many factors determine the file format, from the file system to the paging size, the order in which the records were stored, and so on. A binary comparison for a table would be unreliable.

Determining Whether Two Tables are Different

The CRC approach, while misleading if applied to a file, could still be used for remote comparison, provided that you think outside the box. Here's an example. You want to know whether two tables have any difference at all. This information is useful when you only need to know whether a coping mechanism is working properly, such as when monitoring a replication server. You don't need to know what is different, but the mere fact that there is divergence should ring a bell for the DBA. So, here goes:

SELECT SUM(
   CONV(
     SUBSTRING(
       @CRC:=MD5(CONCAT_WS('/',field1, field2, field3 ...)),
     1, 8),
   16, 10)) AS sig1,
   SUM(CONV(SUBSTRING(@CRC,  9, 8), 16, 10)) AS sig2,
   SUM(CONV(SUBSTRING(@CRC, 17, 8), 16, 10)) AS sig3,
   SUM(CONV(SUBSTRING(@CRC, 25, 8), 16, 10)) AS sig4
FROM
   mytable;
What I get from this query is a set of four numbers, which together can represent the whole table, just as a CRC can represent one row. Let's walk through the innards of this statement, which will become handy very soon. Since the functions are nested, we'll start from the innermost ones and work outwards.

CONCAT_WS is a function that merges all its arguments, separated by the string given as a first argument. On the resulting string, I apply the MD5 function, which gives me a string of 32 characters. That string is then assigned to the user variable @CRC. In MySQL, a user variable is something that you can assign to a value and use later in the same query or in a different one within the same session. One interesting feature of user variables is that they get evaluated only once per row. Thus, I assign the result of MD5 to @CRC to avoid a recalculation for each substring.

Speaking of which, it is the next function that gets evaluated. I extract the first eight characters from @CRC and pass them to the CONV function, which in turn converts the string from base 16 to base 10. Why? I use CONV because I need to do some arithmetic operation (SUM) on the MD5 result, but I can't do it straight away because the MD5 result is a string, not a number. So, why didn't I pass @CRC directly to CONV? Why did I have to split @CRC into four pieces? There are two reasons. The first is that CONV only accepts a 64-bit integer argument; thus, the output from MD5 would be truncated. The second reason is that, even if I could pass the whole number to CONV, the SUM of it would have a fair chance of an overflow.

Once I overcome the limitations set by the DBMS, I get four fields, each one of them a portion of the MD5 calculation, summarized for the entire table. If this query produces the same results in both databases, then the two tables have the same contents. A mismatch in any of the fields would indicate that at least one field in one row has a different value. A handy simplification would be to enclose the four sums into a CONCAT function, so that you only get one result at the price of an additional CPU load for the server. Listing 1 shows an example Perl script to compare two remote tables.

A Clever but Unfocused Solution

Now I have explored all the elements to get to the solution, and I would like to show you how I eventually came to it. The path involved some mistakes, as usual. So, let me walk you through an incorrect solution, which bears some very useful aspects.

I know that I can get a CRC for one row, and I can summarize an entire table and get a global CRC. But, how can I apply such knowledge to find which rows are actually different? I know that the two servers can't interact in this respect without exchanging an unacceptably large amount of data, so I must apply some sort of divide and conquer algorithm to refine the method and get the CRC of a range of rows.

This should not be too hard. It should be enough to use the query seen above, filtered with a WHERE clause. Shouldn't it? Theoretically, yes. Actually, it is doable, and it would be convenient if I had to run the query only once. But for any useful purpose, the query must be able to narrow down the range (or ranges) where the differences lie, therefore it must run several times, with a decreasing range, until the goal is achieved.

To reduce the overhead of calculating a heavy CRC for each row, I create a shadow table with only the minimum information needed to drill down the search. Given a source table "employees", with the definition in Figure 3, the shadow table would have the following structure:

CREATE TABLE shadow_table (
    signature char(32) not null primary key,
    sig1 BIGINT not null,
    sig2 BIGINT not null,
    sig3 BIGINT not null,
    sig4 BIGINT not null,
    PK int not null,
    unique key PK(PK));
And I would fill the table with one query:

INSERT INTO shadow_table
SELECT
  @CRC:=MD5(
  CONCAT_WS('/',employee_code,gender,name,DOB,salary,dept_code,loc_code)),
  CONV(SUBSTRING(@CRC, 1,8),16,10),
  CONV(SUBSTRING(@CRC, 9,8),16,10),
  CONV(SUBSTRING(@CRC,17,8),16,10),
  CONV(SUBSTRING(@CRC,25,8),16,10),
  employee_code
FROM employees;
The significance of this query should be clear from the previous explanation.

Now I need to apply a sensible algorithm to determine which rows are different. After some thinking, I decided to apply the binary search, an established routine with a well-deserved reputation of being efficient.

Here is the outline. To begin, I test the whole table to see whether there is any difference at all. With the shadow table, the query is much nicer and faster:

SELECT
  SUM(sig1), SUM(sig2), SUM(sig3), SUM(sig4)
FROM
  shadow_table;
As for the binary search itself, its implementation is a little trickier than a normal search, due to the possibility that there is more than one difference between the two tables. Assuming that the tables have one thousand rows each, and there are two different records, at ID 750 and 900, the search goes like this:

1. Check the CRC of the ranges 1..512 and 513..1024 (these ranges work better if they are powers of two):

SELECT
  SUM(sig1), SUM(sig2), SUM(sig3), SUM(sig4)
FROM
  shadow_table
WHERE
  PK BETWEEN 1 AND 512;
From this, I get two ranges of values for each server, which I compare in the client.

2. For the first range, I have a match, meaning that there is no difference in the first 512 rows. The values for the second range are different, so I split them again and perform another search.

3. In the next comparison, I get a mismatch in both ranges (513-768, 769-1024), meaning that both have at least one difference. I keep both ranges and split further. Here the algorithm starts to differ from a common binary search, where I retain only one range.

4. After splitting again, I narrow down the ranges until eventually I get the one-line ranges that point me to the differing rows (Figure 4).

It took ten queries for each server to get the result, and I was only transferring tiny portions of data across the network. For a table of one million records, finding a difference would require about twenty queries. It seems that I struck gold, but unfortunately, it does not always work like that.

There are best cases, such as one difference in one million, and there are worst cases, say one difference every one hundred records. In the worst cases, this search performs so badly that you may wish you had used brute force instead.

What then? The principle seemed right, but the implementation was poor. So I resorted to Lasker's principle. Emanuel Lasker was a chess champion about a century ago, and he left a lasting impression on chess philosophy. Among other, more technical teachings, he left an aphorism that I try to apply to several fields, not just chess: "If you see the good move, look for the better one." This idea made me think harder about a better solution.

Refactoring

I was traveling by train, preparing a speech I was giving at the MySQL Users' Conference. Train trips can work wonders for your thoughts, if you offer yourself completely to them, without the distraction of driving. So there I was, thinking about database performance with the background lullaby of a train rattling, when the idea came to me. I acknowledged to myself that my previous reasoning was faulty, and that I was doing with the server what I would do with a client, instead of exploiting the intrinsic power of the database engine.

After this process of self-flagellation, I analyzed the algorithm again and found the weak points. Besides trying to mimic a typical client behavior, the binary search with CRC has the disadvantage that the same record is taken into account several times, thus leading to a large amount of unnecessary calculation. For instance, record number 750 was calculated first in the range 1-1024, then again in the range 512-1024, then in the range 513-768, and so on. This just did not sound right. I was wasting resources and needed a different approach to minimize the calculation and make the task more efficient.

So, here is the new idea. I calculate the ranges in an arbitrary way, just once, and then the search within the range will be left to the database engine. To do this, I must modify the structure of the shadow table from which I decide to remove the sigX fields and instead add a column for grouping purposes:

CREATE TABLE shadow_table
  (signature CHAR(16) NOT NULL PRIMARY KEY,
  PK BIGINT NOT NULL,
  span BIGINT UNSIGNED NOT NULL,
  UNIQUE KEY PK(PK),
  KEY span (span));
The span column will hold the value (referred to the primary key) that the main query will use for grouping the range CRCs. Filling the table is faster than the previous way:

INSERT INTO
   shadow_table (signature, PK, span)
SELECT
   MD5(CONCAT_WS('/',
   employee_code,gender,name,DOB,salary,dept_code, loc_code)),
   employee_code, employee_code / 1024
FROM
   employees;
The filling query is only calculating the CRC for each row. The numbers needed to pass to SUM are generated on the fly in the main query. Here:

SELECT
    span, COUNT(*) as cnt,
    CONCAT(SUM(CONV(SUBSTRING(signature,1,8),16,10)),
    SUM(CONV(SUBSTRING(signature,9,8),16,10)),
    SUM(CONV(SUBSTRING(signature,17,8),16,10)),
    SUM(CONV(SUBSTRING(signature,25,8),16,10))) as sig
FROM
    shadow_table
GROUP BY
    span;
See Listing 2.

Compared to the binary search, the first query returns many more rows than before. For one million records, I get about one thousand rows from each server, and then I compare the result sets in the client. But I don't have to do any more calculations.

In the client, for each range that does not match, I run two more queries. The query to the first server asks for all the signatures in that given range. To the second server, I send a special query asking it to return all the rows that don't match the signatures from the first server (Figure 5):

# first server
SELECT signature FROM shadow_table WHERE span = XX;
# I get signatures "A2F", "B4E", "C1D", etc  

#second server
SELECT
    gender, name, DOB, salary
FROM
    shadow_table
    INNER JOIN employees ON (PK=employee_code)
WHERE
    span = XX
    AND signature NOT IN ("A2F", "B4E", "C1D", ... );
The second query would be a nightmare to create by hand, but fortunately it could be a breeze to build with some handy Perl idioms.

my $columns = $sth1->fetchall_arrayref();
my $query = qq{SELECT PK FROM check_table WHERE span = $_ AND
signature NOT IN (}
  . (join(",", map {"'$_->[0]'"} @$columns ))
  . ")";
See Listing 3.

The whole process is much less painful than the binary search. When I first tried a comparison with this method, it was so much faster I thought that I had blundered somewhere or that perhaps my solution was just an illusion. But a few robust tests confirmed that not only does this method work well, but it also runs from two to five times faster than the binary search.

Limits

This algorithm does not cover every possible case. For educational purposes and for space limitations, I applied it only to the cases where the tables have a numerical and non-composite Primary Key. The algorithm can be applied to other cases as well, but its explanation comes at the price of much more paper to print. Second, this algorithm does not catch the cases where one record is deleted from one table. Fortunately, this side problem can be easily handled by the method shown in the next section.

Finding Missing Records

Before venturing into the table comparison, it would be good practice to check whether the tables have the same number of records and, if they don't, try to find which ones are missing. Finding missing records is a less demanding task than finding the differences, because the operation at stake is just counting -- this feature is built into the database engine and is performed quite efficiently.

The system goes much like the one seen for differences. In each server, run a query to get a count of records by a given range, then compare the lists in the client and build a further query to find the individual missing rows. Because I am counting, (instead of comparing) CRCs, the query must follow a specific order. When comparing the two sets, I get the one with lower count, get the primary keys for that range, then ask the other server to return the records for the range where the given primary keys do not match (Listing 4).

By combining the difference and the counting methods, I should be able to catch any difference between two remote tables. There are two caveats:

1. Choosing an unsound grouping column or expression may lead to incorrect results when both tables have deleted records. The best candidate for this kind of grouping is a single-column numerical primary key divided by a fixed number. My favorite is FLOOR(PKcolumn / 1024).

2. Making ranges too big may result in a query larger than the maximum packet size that MySQL allows. For example, if you make a range of 100,000, then the query to send to the second server will be approximately 600 KB for counting and 1.8 MB for differencing (calculate the size of one "signature", plus commas and quotes.) Because the default size for a packet is 1 MB, your query may be refused.

Some Speculation on CRC Effectiveness

A CRC is a process of constraining information into a standard representation. For the purpose of this research, the CRC seems to be up to the task. By representing a whole record with a CRC string, I can reduce the amount of information that I need to exchange between two servers. There is a conceptual downside, though. Although a record is a reasonably small amount of data, a range of 1000 records is a fairly large chunk of bits. When I was squeezing CRCs into a simple sum, I had the feeling that I was reducing my capability of properly describing the data. If I'd had a proper CRC function that I could apply to a range of records, I would not have been very concerned. After some brainstorming with a few IT enthusiasts, I concluded that there was at least the theoretical possibility of a collision when using a sum to replace a proper CRC.

The reasoning goes like this. I have N numbers, which when summed give a result. It is possible to find another set of N numbers that sum up to the same result. For example, given the total 9, I can reach that result by summing up 5+4, 6+3, 7+2, 8+1, 9+0. Transferring the calculation to the kind of numbers that I am using in this article, how often can I expect such a collision? This thought was bugging me for a while, so I calculated its probability.

Using 128-bit numbers, the probability of having a collision (i.e., at least two numbers that sum up to the same result) is .000000000000000000000000000000000000002938 (2.93e-39).

By comparison, the probability of winning first prize in the U.S. lottery is .00000001 (1.0e-08). This string of zeros has considerably reassured me about the unlikeness of a collision. Therefore, for the moment I assume that the method is reasonably safe.

A Viable Alternative Using Stored Procedures

Using the freshly released stored procedures available in MySQL version 5.0.0, an alternative solution would be to calculate a more robust CRC with a cursor:

CREATE PROCEDURE rangeCRC()
BEGIN
    DECLARE done INT DEFAULT 0;
    DECLARE CONTINUE HANDLER FOR SQLSTATE '02000'
        SET done = 1;
    DECLARE c1 CURSOR FOR
        SELECT MD5(CONCAT_WS('/',
        employee_code,gender,name,DOB,salary,dept_code, loc_code)) FROM employees;
    DECLARE tmpCRC CHAR(32) ;
    
    OPEN c1;
    SET @globalCRC = repeat('0',32);
    
    REPEAT
        FETCH c1 INTO tmpCRC;
        IF NOT done THEN
            SET @globalCRC = MD5(concat(@globalCRC, tmpCRC));
        END IF;  
    UNTIL done
    END REPEAT;
    CLOSE c1;
    # Now @globalCRC contains the range CRC for table employees
END;
MySQL 5.0.0 is still experimental, and therefore I can't recommend this method as of today. However, when 5.x reaches a stable release, I can foresee its usage as a way for easing the path toward a distributed database system.

Planning for a Distributed Database Server

If you have gone through all the above steps, you must have realized that remote comparison is not rocket science. The complication arises from the engine that is not equipped for such a task.

An automatic CRC signature for each record (MySQL has it already) and an appropriate function to access its value would make the shadow table redundant. And, a built-in function to calculate range CRCs would speed up the process even further, besides making it more accurate.

This method needs to be organized and made compatible with the rest of the existing engine, perhaps making the CRC optional, in order to reduce the overhead when this feature is not needed, but it is not impractical. A real distributed DBMS could be really around the corner.

Notes

Thanks to Vahe Sarkissian and Alan Ferrency for commenting an early draft of this document and for their valuable advice.

To compare the structure of two databases, you can use a tool developed by Adam Spiers -- http://search.cpan.org/author/ASPIERS/MySQL-Diff-0.33/

Giuseppe Maxia is a freelance database consultant and CTO of an Italian company specializing in Open Source solutions (www.stardata.it). He is widely known in the Perl and MySQL community through his articles and software contributions.