Cover V14, i07

Article

jul2005.tar

Effective Database Key Generation Techniques

Alexander Daminoff

Unique entity identity is one of the fundamental principles of data modeling. When devising a database of just about any type, the ability to uniquely identify every record is an essential attribute of successful design. In fact, relational database theory mandates that every relation have a primary key -- an attribute or collection of attributes that identifies it uniquely and unambiguously [1]. Besides serving as a unique ID for a given entity, a primary key is also used to establish and maintain inter-entity relationships. Thus, a human resources database is likely to relate employees to their respective departments by tagging every employee record with a department key.

Choosing Primary Keys

What makes a good primary key? Apparently, an attribute or a collection of attributes that uniquely identify a record is a good candidate for a key. However, besides uniqueness, a primary key must satisfy other constraints -- it must be mandatory and immutable. Consider the choice of a social security number, for instance, as a primary key for an employee database. At first, it seems to be a good candidate because of its uniqueness, but is it really mandatory? While it certainly is for U.S. employees, international companies would have a problem with such database design, simply because many countries overseas do not use similar identification schemes. Immutability is another important factor, as illustrated by the following example.

Is department name a good candidate key for a department relation? Perhaps -- because it's unique within a given organization and is indisputably mandatory. However, it is volatile and likely to change due to organizational restructuring or other activities that routinely take place within any organization. A change of a primary key is disastrous; it necessitates changes not only to its immediate relation but also to all other relations that use the key as a reference. Thus, a change of a department name would affect all employees associated with this department via the department name key.

Properties or attributes of an entity or an object that uniquely and unambiguously identify it within a certain scope are commonly referred to as natural keys. Natural keys are often the preferred choice of unique identity when designing a database because they are easy to use, understand, and memorize. Natural keys are also inexpensive and don't require time and resource-consuming key generation process. However, natural keys, of course, have limitations. Despite the often abundant choice of candidate keys, a good primary key is not always available.

Let's return to the employee database example. What would make a good primary key for an employee relation? We already dismissed the choice of social security number -- at least for international organizations. What about the combination of first and last names? Still not quite satisfactory -- the names are not unique and are prone to changes. So, how do the database designers solve a problem of not having a reliable natural key to ensure the unique entity identity? They use surrogate keys. A surrogate key is an artificially created unique identifier that usually has no business meaning so that its sole purpose is to guarantee unique and unambiguous way to identify an entity. Thus many organizations, such as the one I'm currently employed with, use artificially created employee IDs as primary means to unambiguously identify employees.

Globally Unique Identifiers

Perhaps the most popular approach to resolving the entity identity crisis is using so-called Globally Unique Identifiers (GUIDs; also known as Universally Unique Identifiers or UUIDs). A GUID is 128 bits long and is guaranteed to be different from all other GUIDs generated until 3400 A.D. GUIDs were originally used in the Network Computing System (NCS) and later in the Open Software Foundation's (OSF) Distributed Computing Environment. Currently, many different technologies rely on GUIDs to provide unique identity for various software components. Microsoft COM/DCOM, for example, uses GUIDs very extensively to uniquely identify classes, applications, and components across network-connected systems.

While there are few different algorithms for generating GUIDs [2], they all satisfy the same requirements -- guarantee GUID uniqueness within time and space, and provide reasonably efficient GUID allocation rates. Thus, most algorithms allow for producing as many as 10 million GUIDs per second per machine, which makes them suitable for identifying both extremely short-lived and very persistent objects on a given system as well as across the network.

The benefits of GUID-based identity schemes are apparent -- GUIDs are guaranteed to be unique, which greatly reduces any chances of collisions; GUIDs are also relatively easy and cheap to generate. The problem is that GUIDs are intended for machine consumption and are not particularly human-friendly; they are quite meaningless and difficult to memorize. Nevertheless, the benefits of GUIDs greatly outweigh minor inconveniences commonly associated with their use in database modeling and design, and nowadays most database and operating systems offer some sort of functionality for generating GUIDs or GUID-like surrogate keys.

Naive Approach to Key Generation

When tasked with designing a surrogate-key based relation, some less-experienced data modelers resort to using sequence numbers as primary keys. In fact, most modern databases provide automatic sequence number generation functionality, such as Microsoft Access's AutoNumber and Sybase's IDENTITY. At first, automatic sequence number generation seems very tempting -- the task of assigning a unique numeric identifier to each entity is relegated to the database and takes place automatically as data rows are loaded into a table with an automatic sequence number column. However, this database design choice often creates nightmares for the database maintainers.

The most widespread problem with automatic sequence numbers is referred to as "identity gaps". Sybase ASE, for instance, is known to suffer from the identity gap problem -- large, sudden, and unexpected jumps of values in a key column tagged with IDENTITY property. Such identity gaps are usually a result of ungraceful shutdowns or system crashes, but they can also be caused by certain backup/restore scenarios. Although there are numerous recipes for avoiding and fixing identity gaps, they make the IDENTITY functionality troublesome at best.

Disappointed with automatic sequence number generation, many database designers adopt the "do-it-yourself" approach. An obvious solution is to simply designate an integer column of a table in question as the primary key and then keep track of the last-assigned sequence number via some sort of counter table. Consider the following SQL code:

create table key_counter (
   last_key int not null
)
insert into key_counter values (0)

declare @key int
select  @key = last_key+1 from key_counter
update  key_counter set last_key = last_key + 1
Although it looks simple, this code creates a whole slew of problems. Consider a scenario where two tasks (A and B) are attempting to generate the next sequence number simultaneously. Task A reads the initial value from the counter table into its @key variable and ends up with a value of one. At the same time, task B also reads the counter value from the same table and, to nobody's surprise, ends up with the same value of one simply because task A hasn't updated the counter table yet. The result is a duplicate key value -- exactly what we were trying to avoid in the first place. Attempting to fix the problem by wrapping the key generation logic within a database transaction only makes things worse:

declare @key int
begin transaction
select  @key = last_key+1 from key_counter holdlock
update  key_counter set last_key = last_key + 1
commit transaction
First, the fact that the shared read lock here is held until the end of the transaction does not prevent the other task from applying yet another shared lock, so the problem of key duplication still exists. Second, a shared lock will not be promoted to exclusive update lock unless the competing task releases its shared lock. Because the shared locks will not be released until the transaction completes, a deadlock will result. The right solution would be to apply the exclusive lock at the beginning of the transaction so that task A can proceed to completion while task B is blocked:

declare @key int
begin transaction
update  key_counter set last_key = last_key + 1
select  @key = last_key  from key_counter
commit transaction
Although the above code will work and produce expected results, its main drawback is the necessity to update the key counter table for every generated key. Even though modern databases are capable of carrying out the update operation in fractions of a millisecond, in a multitasking environment such design will create a hot spot. That means numerous tasks will block waiting their turn to update the key counter table, thereby greatly affecting the performance of the key generation process.

Better Sequence Number Allocation

As I stated previously, the main problem with using the key counter tables for sequence number generations is the cost associated with the disk access incurred for every key generated. Thus, the obvious way to make this algorithm more efficient is to devise an approach that will minimize the required number of I/O and therefore reduce the contention over the key counter table. A common technique is to pre-allocate intervals of numbers to all processes attempting to generate sequential keys. Consider the following Perl code:

create table key_counter (
   run_no int not null,
   seq_no int nut null
)
insert into key_counter values (0,0)
...
...

use DBI;
use Carp;

$dsn = "dbi:ODBC:DSN=MY_SQL_SERVER;database=mydatabase";
$conn = DBI->connect($dsn, "user", "password");
croak "Can't connect to $dsn...\n" unless $conn;

$seq_length = 4;
$key_ref    = undef;

my $key = nextid($conn);
print "generated $key...\n";
exit(0);

sub load_counter {
   my $conn = shift;
   $conn->begin_work();
   $conn->do("update key_counter set run_no = run_no + 1");
   my $key_ref = $conn->selectrow_hashref("select * from key_counter");
   $conn->commit();
   $key_ref->{seq_no} = 0;
}

sub nextid {
   my $conn = shift;
   load_counter($conn) unless defined($key_ref) and
      $key_ref->{seq_no} < (2**$seq_length - 1);
   return sprintf("0x%08X",
      ($key_ref->{run_no} << $seq_length) | $key_ref->{seq_no}++);
}
Here the key_counter table has two fields rather than one -- what I call the run ID and a sequence number. The idea is rather simple -- the run ID denotes the upper portion of the key -- i.e., n most significant bits of the key, while the sequence number represents the least significant bits of the same key. The table is updated only when it is necessary to increment the run ID, and the sequence number is incremented in memory without incurring the additional I/O.

The number of bits of the key space allocated to the run ID and the sequence number, respectively, is configurable; it is controlled by the $seq_length variable. Thus, if $seq_length is set to 4, the sequence number will occupy the 4 least significant bits of the resulting integer key, while the run ID will represent the remaining 28 bits of the key space. This will allow for generating 15 sequential keys with a single database access. Setting the $seq_length to 8 will yield 255 sequential keys at the expense of a single database access and so on. In theory, a given process that attempts to generate sequential keys may have some prior knowledge of the required number of keys. Thus, a data loading program can determine the number of records for which it must generate the keys and then can adjust the $seq_length control variable accordingly.

The bulk of the work here is done by the nextid() subroutine, which first checks whether the run ID has already been initialized ($key_ref is set) and whether the process has already exceeded the number of generated keys per a single increment of the run ID (sequence number equal to or greater than the maximum number that can be represented by the $seq_length bits). If the sequence number is within the limit allowed by the $seq_length, the nextid() then builds a key by combining the run ID, shifted left by $seq_length bits with a sequence number incremented by one. If the number of allowed sequence numbers has been exhausted, the nextid() calls load_counter() subroutine, which updates the run ID and resets the sequence number to zero.

This little algorithm, despite its simplicity, is very effective and allows for fast and virtually contention-free generation of sequential integer keys. Perhaps its most attractive feature is its ability to control the number and frequency of database I/O while minimizing the gaps in generated integer keys.

UUID-Based Keys

There are thousands of situations where sequential integer keys are the only efficient solution to the problem. For example, when loading vast amounts of data into a table, clustered on an integer key, the speed of the loading process will greatly improve if the generated keys are sequential rather than random. However, there are a few problems with sequence numbers that make them less desirable under certain circumstances.

First, sequence numbers are fairly expensive to generate, as it is always necessary to keep track of the last-generated number via some sort of persistent data structure, hence the I/O and unavoidable contention. Second, the integer key space is rather limited, and (even if you use 64-bit rather than 32-bit integers) it is always possible to simply "run out of numbers" when dealing with large amounts of data. Thus, whenever the sequential nature of the key is not important, UUID-based key generation schemes are far superior; they guarantee unsurpassable speed at low cost and provide for a virtually unlimited number of keys.

There are numerous facilities for generating UUIDs, provided by databases, operating systems, and add-on software. Microsoft SQL Server, for instance, provides NEWID() function, which creates a value of type "uniqueidentifier" -- a GUID. Microsoft Windows, as part of the COM/DCOM subsystem, offers a function called CoCreateGUID, which also generates GUID-style keys. However, for the sake of illustrating the cross-platform portability of the solution, rather than using a database or OS-specific GUID generation algorithm, let's look at one of the publicly available Perl modules -- Data::UUID [3].

This module relies on a portable algorithm [2] and works on virtually any platform. It supports very high UUID allocation rates -- up to 10 million per second per machine -- and guarantees the uniqueness of the generated keys. Using the module is very easy:

use Data::UUID;

$ug    = new Data::UUID;
$uuid1 = $ug->create();
print $ug->to_string($uuid1);
Once the UUID generator ($ug) is instantiated, its create() method can be used to generate unlimited number of binary UUIDs. The module also support numerous convenience functions, such as to_string(), which produces the string representation of a GUID.

Conclusion

Despite its seeming simplicity, key generation is a fundamental issue known to be the source of the most annoying and frustrating database problems. I hope that this article has shed some light on simple yet effective key generation techniques. If it helps someone to avoid one or two database nightmares, I will consider my mission accomplished.

References

1. Date, C. J. 1999. An Introduction to Database Systems, 7th ed. Addison Wesley Longman. ISBN: 0201385902.

2. Leach, Paul J. and Rich Salz. Internet Draft "UUIDs and GUIDs": http://www.globecom.net/ietf/draft/draft-leach-uuids-guids-01.html

3. Data::UUID -- Perl extension for generating Globally/Universally Unique Identifiers (GUIDs/UUIDs): http://search.cpan.org/~AGOLOMSH/Data-UUID/

Alexander Daminoff is currently employed as a Senior Technology Officer by JP Morgan Chase Inc, where he acts as a project manager supporting numerous large-scale control applications for the Legal and Compliance Department. He can be reached at: adaminoff@cronossystems.com.