Effective
Database Key Generation Techniques
Alexander Daminoff
Unique entity identity is one of the fundamental principles of
data modeling. When devising a database of just about any type,
the ability to uniquely identify every record is an essential attribute
of successful design. In fact, relational database theory mandates
that every relation have a primary key -- an attribute or collection
of attributes that identifies it uniquely and unambiguously [1].
Besides serving as a unique ID for a given entity, a primary key
is also used to establish and maintain inter-entity relationships.
Thus, a human resources database is likely to relate employees to
their respective departments by tagging every employee record with
a department key.
Choosing Primary Keys
What makes a good primary key? Apparently, an attribute or a collection
of attributes that uniquely identify a record is a good candidate
for a key. However, besides uniqueness, a primary key must satisfy
other constraints -- it must be mandatory and immutable. Consider
the choice of a social security number, for instance, as a primary
key for an employee database. At first, it seems to be a good candidate
because of its uniqueness, but is it really mandatory? While it
certainly is for U.S. employees, international companies would have
a problem with such database design, simply because many countries
overseas do not use similar identification schemes. Immutability
is another important factor, as illustrated by the following example.
Is department name a good candidate key for a department relation?
Perhaps -- because it's unique within a given organization
and is indisputably mandatory. However, it is volatile and likely
to change due to organizational restructuring or other activities
that routinely take place within any organization. A change of a
primary key is disastrous; it necessitates changes not only to its
immediate relation but also to all other relations that use the
key as a reference. Thus, a change of a department name would affect
all employees associated with this department via the department
name key.
Properties or attributes of an entity or an object that uniquely
and unambiguously identify it within a certain scope are commonly
referred to as natural keys. Natural keys are often the preferred
choice of unique identity when designing a database because they
are easy to use, understand, and memorize. Natural keys are also
inexpensive and don't require time and resource-consuming key
generation process. However, natural keys, of course, have limitations.
Despite the often abundant choice of candidate keys, a good primary
key is not always available.
Let's return to the employee database example. What would
make a good primary key for an employee relation? We already dismissed
the choice of social security number -- at least for international
organizations. What about the combination of first and last names?
Still not quite satisfactory -- the names are not unique and
are prone to changes. So, how do the database designers solve a
problem of not having a reliable natural key to ensure the unique
entity identity? They use surrogate keys. A surrogate key is an
artificially created unique identifier that usually has no business
meaning so that its sole purpose is to guarantee unique and unambiguous
way to identify an entity. Thus many organizations, such as the
one I'm currently employed with, use artificially created employee
IDs as primary means to unambiguously identify employees.
Globally Unique Identifiers
Perhaps the most popular approach to resolving the entity identity
crisis is using so-called Globally Unique Identifiers (GUIDs; also
known as Universally Unique Identifiers or UUIDs). A GUID is 128
bits long and is guaranteed to be different from all other GUIDs
generated until 3400 A.D. GUIDs were originally used in the Network
Computing System (NCS) and later in the Open Software Foundation's
(OSF) Distributed Computing Environment. Currently, many different
technologies rely on GUIDs to provide unique identity for various
software components. Microsoft COM/DCOM, for example, uses GUIDs
very extensively to uniquely identify classes, applications, and
components across network-connected systems.
While there are few different algorithms for generating GUIDs
[2], they all satisfy the same requirements -- guarantee GUID
uniqueness within time and space, and provide reasonably efficient
GUID allocation rates. Thus, most algorithms allow for producing
as many as 10 million GUIDs per second per machine, which makes
them suitable for identifying both extremely short-lived and very
persistent objects on a given system as well as across the network.
The benefits of GUID-based identity schemes are apparent --
GUIDs are guaranteed to be unique, which greatly reduces any chances
of collisions; GUIDs are also relatively easy and cheap to generate.
The problem is that GUIDs are intended for machine consumption and
are not particularly human-friendly; they are quite meaningless
and difficult to memorize. Nevertheless, the benefits of GUIDs greatly
outweigh minor inconveniences commonly associated with their use
in database modeling and design, and nowadays most database and
operating systems offer some sort of functionality for generating
GUIDs or GUID-like surrogate keys.
Naive Approach to Key Generation
When tasked with designing a surrogate-key based relation, some
less-experienced data modelers resort to using sequence numbers
as primary keys. In fact, most modern databases provide automatic
sequence number generation functionality, such as Microsoft Access's
AutoNumber and Sybase's IDENTITY. At first, automatic sequence
number generation seems very tempting -- the task of assigning
a unique numeric identifier to each entity is relegated to the database
and takes place automatically as data rows are loaded into a table
with an automatic sequence number column. However, this database
design choice often creates nightmares for the database maintainers.
The most widespread problem with automatic sequence numbers is
referred to as "identity gaps". Sybase ASE, for instance,
is known to suffer from the identity gap problem -- large, sudden,
and unexpected jumps of values in a key column tagged with IDENTITY
property. Such identity gaps are usually a result of ungraceful
shutdowns or system crashes, but they can also be caused by certain
backup/restore scenarios. Although there are numerous recipes for
avoiding and fixing identity gaps, they make the IDENTITY functionality
troublesome at best.
Disappointed with automatic sequence number generation, many database
designers adopt the "do-it-yourself" approach. An obvious
solution is to simply designate an integer column of a table in
question as the primary key and then keep track of the last-assigned
sequence number via some sort of counter table. Consider the following
SQL code:
create table key_counter (
last_key int not null
)
insert into key_counter values (0)
declare @key int
select @key = last_key+1 from key_counter
update key_counter set last_key = last_key + 1
Although it looks simple, this code creates a whole slew of problems.
Consider a scenario where two tasks (A and B) are attempting to generate
the next sequence number simultaneously. Task A reads the initial
value from the counter table into its @key variable and ends up with
a value of one. At the same time, task B also reads the counter value
from the same table and, to nobody's surprise, ends up with the
same value of one simply because task A hasn't updated the counter
table yet. The result is a duplicate key value -- exactly what
we were trying to avoid in the first place. Attempting to fix the
problem by wrapping the key generation logic within a database transaction
only makes things worse:
declare @key int
begin transaction
select @key = last_key+1 from key_counter holdlock
update key_counter set last_key = last_key + 1
commit transaction
First, the fact that the shared read lock here is held until the end
of the transaction does not prevent the other task from applying yet
another shared lock, so the problem of key duplication still exists.
Second, a shared lock will not be promoted to exclusive update lock
unless the competing task releases its shared lock. Because the shared
locks will not be released until the transaction completes, a deadlock
will result. The right solution would be to apply the exclusive lock
at the beginning of the transaction so that task A can proceed to
completion while task B is blocked:
declare @key int
begin transaction
update key_counter set last_key = last_key + 1
select @key = last_key from key_counter
commit transaction
Although the above code will work and produce expected results, its
main drawback is the necessity to update the key counter table for
every generated key. Even though modern databases are capable of carrying
out the update operation in fractions of a millisecond, in a multitasking
environment such design will create a hot spot. That means numerous
tasks will block waiting their turn to update the key counter table,
thereby greatly affecting the performance of the key generation process.
Better Sequence Number Allocation
As I stated previously, the main problem with using the key counter
tables for sequence number generations is the cost associated with
the disk access incurred for every key generated. Thus, the obvious
way to make this algorithm more efficient is to devise an approach
that will minimize the required number of I/O and therefore reduce
the contention over the key counter table. A common technique is
to pre-allocate intervals of numbers to all processes attempting
to generate sequential keys. Consider the following Perl code:
create table key_counter (
run_no int not null,
seq_no int nut null
)
insert into key_counter values (0,0)
...
...
use DBI;
use Carp;
$dsn = "dbi:ODBC:DSN=MY_SQL_SERVER;database=mydatabase";
$conn = DBI->connect($dsn, "user", "password");
croak "Can't connect to $dsn...\n" unless $conn;
$seq_length = 4;
$key_ref = undef;
my $key = nextid($conn);
print "generated $key...\n";
exit(0);
sub load_counter {
my $conn = shift;
$conn->begin_work();
$conn->do("update key_counter set run_no = run_no + 1");
my $key_ref = $conn->selectrow_hashref("select * from key_counter");
$conn->commit();
$key_ref->{seq_no} = 0;
}
sub nextid {
my $conn = shift;
load_counter($conn) unless defined($key_ref) and
$key_ref->{seq_no} < (2**$seq_length - 1);
return sprintf("0x%08X",
($key_ref->{run_no} << $seq_length) | $key_ref->{seq_no}++);
}
Here the key_counter table has two fields rather than one -- what
I call the run ID and a sequence number. The idea is rather simple
-- the run ID denotes the upper portion of the key -- i.e.,
n most significant bits of the key, while the sequence number represents
the least significant bits of the same key. The table is updated only
when it is necessary to increment the run ID, and the sequence number
is incremented in memory without incurring the additional I/O.
The number of bits of the key space allocated to the run ID and
the sequence number, respectively, is configurable; it is controlled
by the $seq_length variable. Thus, if $seq_length is set to 4, the
sequence number will occupy the 4 least significant bits of the
resulting integer key, while the run ID will represent the remaining
28 bits of the key space. This will allow for generating 15 sequential
keys with a single database access. Setting the $seq_length to 8
will yield 255 sequential keys at the expense of a single database
access and so on. In theory, a given process that attempts to generate
sequential keys may have some prior knowledge of the required number
of keys. Thus, a data loading program can determine the number of
records for which it must generate the keys and then can adjust
the $seq_length control variable accordingly.
The bulk of the work here is done by the nextid() subroutine,
which first checks whether the run ID has already been initialized
($key_ref is set) and whether the process has already exceeded the
number of generated keys per a single increment of the run ID (sequence
number equal to or greater than the maximum number that can be represented
by the $seq_length bits). If the sequence number is within the limit
allowed by the $seq_length, the nextid() then builds a key by combining
the run ID, shifted left by $seq_length bits with a sequence number
incremented by one. If the number of allowed sequence numbers has
been exhausted, the nextid() calls load_counter() subroutine, which
updates the run ID and resets the sequence number to zero.
This little algorithm, despite its simplicity, is very effective
and allows for fast and virtually contention-free generation of
sequential integer keys. Perhaps its most attractive feature is
its ability to control the number and frequency of database I/O
while minimizing the gaps in generated integer keys.
UUID-Based Keys
There are thousands of situations where sequential integer keys
are the only efficient solution to the problem. For example, when
loading vast amounts of data into a table, clustered on an integer
key, the speed of the loading process will greatly improve if the
generated keys are sequential rather than random. However, there
are a few problems with sequence numbers that make them less desirable
under certain circumstances.
First, sequence numbers are fairly expensive to generate, as it
is always necessary to keep track of the last-generated number via
some sort of persistent data structure, hence the I/O and unavoidable
contention. Second, the integer key space is rather limited, and
(even if you use 64-bit rather than 32-bit integers) it is always
possible to simply "run out of numbers" when dealing with
large amounts of data. Thus, whenever the sequential nature of the
key is not important, UUID-based key generation schemes are far
superior; they guarantee unsurpassable speed at low cost and provide
for a virtually unlimited number of keys.
There are numerous facilities for generating UUIDs, provided by
databases, operating systems, and add-on software. Microsoft SQL
Server, for instance, provides NEWID() function, which creates a
value of type "uniqueidentifier" -- a GUID. Microsoft
Windows, as part of the COM/DCOM subsystem, offers a function called
CoCreateGUID, which also generates GUID-style keys. However, for
the sake of illustrating the cross-platform portability of the solution,
rather than using a database or OS-specific GUID generation algorithm,
let's look at one of the publicly available Perl modules --
Data::UUID [3].
This module relies on a portable algorithm [2] and works on virtually
any platform. It supports very high UUID allocation rates --
up to 10 million per second per machine -- and guarantees the
uniqueness of the generated keys. Using the module is very easy:
use Data::UUID;
$ug = new Data::UUID;
$uuid1 = $ug->create();
print $ug->to_string($uuid1);
Once the UUID generator ($ug) is instantiated, its create() method
can be used to generate unlimited number of binary UUIDs. The module
also support numerous convenience functions, such as to_string(),
which produces the string representation of a GUID.
Conclusion
Despite its seeming simplicity, key generation is a fundamental
issue known to be the source of the most annoying and frustrating
database problems. I hope that this article has shed some light
on simple yet effective key generation techniques. If it helps someone
to avoid one or two database nightmares, I will consider my mission
accomplished.
References
1. Date, C. J. 1999. An Introduction to Database Systems,
7th ed. Addison Wesley Longman. ISBN: 0201385902.
2. Leach, Paul J. and Rich Salz. Internet Draft "UUIDs and
GUIDs": http://www.globecom.net/ietf/draft/draft-leach-uuids-guids-01.html
3. Data::UUID -- Perl extension for generating Globally/Universally
Unique Identifiers (GUIDs/UUIDs): http://search.cpan.org/~AGOLOMSH/Data-UUID/
Alexander Daminoff is currently employed as a Senior Technology
Officer by JP Morgan Chase Inc, where he acts as a project manager
supporting numerous large-scale control applications for the Legal
and Compliance Department. He can be reached at: adaminoff@cronossystems.com. |