Cover V13, i07

Article

jul2004.tar

Lightweight Persistent Data

Randal L. Schwartz

Frequently, you have data with a strong will to live. That is, your data must persist between invocations of your program and occasionally even be shared between simultaneous invocations.

At the high end of this demand, we have entire companies devoted to creating high-performance, multi-user, SQL-interfaced databases. These databases are usually accessed from Perl via the DBI package, or by some wrapper slightly above DBI, such as Class::DBI or DBIx::SQLEngine. The details of SQL might even be entirely hidden away using a higher level package like Tangram or Alzabo.

But further down the scale, there are some new solutions popping onto the scene, which invite further observation, as well as some old classic solutions. For example, since Perl version 2 we've been able to put a hash out on disk with dbmopen:

dbmopen(%HASH, "/path/on/disk", 0644) || die;
$HASH{"key"} = "value";
dbmclose(%HASH);
The effect of such code is that we now have a key/value pair stored in an external structured file. We can later come along and reopen the database as a hash again, and treat it as if it were a hash with preexisting values:

dbmopen(%HASH, "/path/on/disk", 0644) || die;
foreach $key (sort keys %HASH) {
  print "$key => $HASH{$key}\n";
}
dbmclose(%HASH);
While the interface was relatively simple, I wrote quite a few programs before Perl5 came around using this storage mechanism for my persistence. However, this storage suffered some limitations: the keys and values had to be under a given size, access to the structure could not handle multi-user reads and writes, and the resulting data files were not necessarily portable to other machines (because they used incompatible libraries or byte orders).

When Perl5 came long, new problems arose. No longer were we limited to just arrays and hashes, but we could now have complex data types with arbitrary structure. Luckily, the mechanism "behind" the dbmopen was made available directly at the Perl code level, through the tie operator, described in the perltie manpage. This let others besides Larry Wall create "magical" hashes that could perform actions on every fetch and store.

One early use of the tie mechanism was the MLDBM package, which could take a complex value to be assigned for a given key, and serialize it to a single string value, which could then be stored much like before. For example:

use MLDBM;
tie my %hash, 'MLDBM' or die;
$hash{my_array} = [1..5];
$hash{my_scores} = { fred => 205, barney => 195, dino => 30 };
As each complex data structure was stored into the hash, it was converted into a string, using Data::Dumper, FreezeThaw, or Storable. If a value were fetched, it would be converted back from a string to the complex data structure. However, the resulting value was no longer related to the tied hash. For example:

my $scores = $hash{my_scores};
$scores->{fred} = 215;
would no longer affect the stored data. Instead, we got warnings on the MLDBM manpage to "not do this". Also, we still had all the limitations of a standard dbmopen-style database: size limits, multi-user access, and non-portability.

One solution that I used on more than one occasion was to take over the serialization myself, and to use Storable's retrieve and nstore operations directly. My code would look something like:

use Storable qw(nstore retrieve);
my $data = retrieve('file');
... perform operations with $data ...
nstore $data, 'file';
Now my $data value could be an arbitrarily complex data structure, and any changes I made would be completely reflected in the updated file. The result was that I simply had a Perl data strucure that persisted.

It appears that the author of Tie::Persistent had the same idea to use Storable on the entire top-level structure as well, except with a tie wrapper instead of explicit fetch-store phases, although I can't vouch for the code. In fact, I see a number of CPAN entries that all seemed to find similar mechanisms, but none of them seemed to have found the "holy grail" of object persistence, making it as absolutely transparent as possible in a nice portable (and hopefully multi-user) manner.

That is, until I noticed DBM::Deep. According to the Changelog, this distribution has been around for about two years (as I write this), but only on the CPAN for a few months. From its own description:

A unique flat-file database module, written in pure perl. True multi-level hash/array support (unlike MLDBM, which is faked), hybrid OO / tie() interface, cross-platform FTPable files, and quite fast. Can handle millions of keys and unlimited hash levels without significant slow-down. Written from the ground-up in pure perl -- this is NOT a wrapper around a C-based DBM. Out-of-the-box compatibility with Unix, Mac OS X and Windows.

And with a promotional paragraph like that, I just had to look. It looks simple enough. I merely say:

use DBM::Deep;
my $hash = DBM::Deep->new("foo.db");
$hash->{my_array} = [1..5];
$hash->{my_scores} = { fred => 205, barney => 195, dino => 30 };
And that's it. In my next program:

use DBM::Deep;          
my $hash = DBM::Deep->new("foo.db");
$hash->{my_scores}->{fred} = 215; # update score
And finally, retrieving it all:

use DBM::Deep;
my $hash = DBM::Deep->new("foo.db");
print join(", ",@{$hash->{my_array}}), "\n";
for (sort keys %{$hash->{my_scores}}) {
  print "$_ => $hash->{my_scores}->{$_}\n";
}
which prints:

1, 2, 3, 4, 5
barney => 195
dino => 30
fred => 215
And, in fact, that all just plain worked. I'm impressed. We've avoided the MLDBM problem, because the update to the nested data worked. And, there's no dependency on traditional DBMs here, so there's no size limitation or byte ordering, or even the need for a C compiler to install.

I'm told, although I haven't tested it, that I can also add:

$hash->lock;
... do some shared things ...
$hash->unlock;
and thereby access shared data in multiple processes.

There also seems to be some cool stuff around encrypting or compressing the data as well. This definitely bears further examination.

The limitations of DBM::Deep seem rather expected. Because this is a single data file, it's being locked using flock, so we can't persist data for multiple users across machines or reliably across NFS. Also, we have to clean up after ourselves from time to time by calling an optimize method -- otherwise, unused space starts accumulating in the database.

One other recent addition to the CPAN also caught my eye -- OOPS. Unlike DBM::Deep, OOPS uses a DBI-style database (currently only compatible with PostgreSQL, MySQL, and SQLite) for its persistent store. However, like DBM::Deep, once a connection is made, you pretty much do anything you want with the data structure, and it gets reflected into the permanent storage. The database tables are created on request, and managed by the module transparently.

The basic mode of OOPS looks like:

use OOPS;
transaction(sub {
  OOPS->initial_setup(
    dbi_dsn => 'dbi:SQLite:/tmp/oops',
    username => undef, # no matter with SQLite
    password => undef, # ditto
  ) unless -s "/tmp/oops";

  my $hash = OOPS->new(
    dbi_dsn => 'dbi:SQLite:/tmp/oops',
    username => undef, # no matter with SQLite
    password => undef, # ditto
  );

  $hash->{my_array} = [1..5];
  $hash->{my_scores} = { fred => 205, barney => 195, dino => 30 };
  $hash->{my_scores}->{fred} = 215; # update score

  $hash->commit;
});
The wrapper of transaction forces this update to all be within a single transaction. We fetch the data similarly:

use OOPS;
transaction(sub {
  my $hash = OOPS->new(
    dbi_dsn => 'dbi:SQLite:/tmp/oops',
    username => undef, # no matter with SQLite
    password => undef, # ditto
  );

  print join(", ",@{$hash->{my_array}}), "\n";
  for (sort keys %{$hash->{my_scores}}) {
    print "$_ => $hash->{my_scores}->{$_}\n";
  }
});
And, in fact, this retrieved exactly the values I had expected. I'll be exploring these two modules in greater depth in the future, and until then, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.