Cover V13, i08
aug2004.tar

Schemaball: A New Spin on Database Visualization

Martin Krzywinski

Understanding relationships and patterns in complex database schemas is simplified when the data can be effectively visualized. Schema diagrams are particularly useful when numerous entities and relationships are involved. Conventions for drawing schemas, such as the entity-relationship diagram (ERD) and the general Universal Modeling Language (UML) framework, provide recipes to draw a wide range of entities and relationships (see references). Their visual vocabulary is highly controlled, to ensure consistency, and rich, to allow flexibility.

In an ERD, a table (entity) is represented by a rectangle, constraints (relationships) by rhombuses that link the table symbols, while cardinality of relationships is indicated by glyphs on the head and tail of the constraint lines. A large array of open source and commercial tools exist to generate ERDs (see references).

Unfortunately, the ERDs for large databases quickly become difficult to follow. At the same time, it is precisely in these cases where schema illustrations become indispensable in development and optimization. For example, our MySQL sequencing LIMS system, which models laboratory protocols and stores sequence data, contains 141 tables and 205 foreign keys. Information about a single laboratory process involves multiple constraints and tables, making the process of tracing data flow in an ERD very cumbersome.

To more quickly grasp the large-scale structure of schemas, it is necessary to reduce the complexity of a schema diagram while maintaining the pertinent details. Drawing hundreds of tables with as many (or more) constraints in a traditional ERD often produces a disheartening jumble.

In this article, I will describe a circular compositing method to produce schema images of large databases in a manner that limits the complexity of the drawing while retaining constraint and table information. Figure 1 shows a classical ERD (A) and the circularly composited schema (B,C) of a 10-table schema that models a company of managers and workers. The circular schema is easy to generate and can be drawn compactly.

The circular composition has a number of advantages. Routing of constraint lines is trivial -- each constraint is drawn independently of others. Constraints may be drawn with straight lines (Figure 1B) or curves, such as the Bezier type (Figure 1C). The advantage of Bezier curves is both aesthetic and practical. The constraint lines are radially tangent at the point of contact with a table symbol. It is therefore easy to estimate the degree of connectedness of any table. For example, Figure 1C quickly shows that the worker is the most interrelated entity. This information is presented more clearly than in Figure 1A because of the symmetry and tidiness of the circular composition. The benefits of circular composition become more compelling as the size of the database grows. Figure 2 shows the schema of our LIMS database with all constraints from the Employee table (Table 28) highlighted.

The remaining portion of this article presents Schemaball, an application to draw circularly composited schemas. Options such as hiding tables and highlighting features (Figure 3A, 3B), as well as adjusting geometry and following constraint hierarchy (Figure 3C), are implemented. Schemaball has been designed to facilitate automated generation of images and is therefore a command-line application. All settings are stored in a configuration file, making it easy to have multiple configurations for different databases and schema images.

Implementation

Schemaball is implemented in Perl and uses DBI to connect to databases, SQL::Translator to parse SQL dumps, and GD to produce PNG images. Support for external configuration files is accomplished using Config::General. The list of tables and constraints is obtained by polling the database, parsing SQL schema dumps, or by using flat files that list the tables and constraints. Schemaball is compatible with any SQL server for which DBI drivers are available, because all SQL statements used by Schemaball (e.g., SHOW TABLES, DESCRIBE tablename) are defined in a configuration file. In this article, all SQL statements are standardized to MySQL format.

A variety of schema images can be generated by toggling visibility of elements in a schema image. Visibility and highlighting are controlled by regular expressions applied to table and constraint names. By selectively applying visibility and highlights, a schema image can be created as a series of layers to be manipulated and composited downstream (e.g., GIMP). Such composites provide suitable illustrations for publications, in which the schema needs to be represented at a high level.

Databases and SQL Commands

Schemaball does not require a live database server to generate images. Data can be fed into the application using database schema dumps, which contain CREATE TABLE directives, or an external file, which lists the table names:

<db>
### connect directly
host = neurotica
database = sequence
user = martink
password = password
### parse dump with SQL::Translator
# sqlfile= sequence.sql
### table names from file
# tablefile = tablenames.txt
</db>
SQL-related parameters are stored in a <sqldb> block. The database type is specified by the type variable (the corresponding DBI driver module must be installed). All SQL statements are stored in named <sql> sub-blocks that additionally specify the column number of the query return field to be parsed:

<sqldb>
type = mysql
<sql listtables>
    sql = "show tables"
    colnum = 0
</sql>
. . .
<sql tablecreate>
    sql = "show create table like _TABLE_"
    colnum = 0
</sql>
</sqldb>
Constraint Parsing

Schemaball provides three ways to determine foreign keys. This flexibility is motivated by the popularity of MySQL and historical lack of foreign key support. Foreign keys have been implemented for InnoDB tables since version 3.23.44, but MyISAM tables still lack this functionality, which is planned for future releases. Foreign key directives in TABLE CREATE statements for MyISAM tables are parsed by the server but provide no functionality.

Constraints can be extracted from MySQL table structures by using the statements SHOW CREATE TABLE LIKE tablename or SHOW TABLE STATUS FROM database LIKE tablename. For example, polling the worker table (Figure 1) with SHOW CREATE TABLE produces constraint lines like the following:

> SHOW CREATE TABLE LIKE worker;
. . .
CONSTRAINT '0_19' FOREIGN KEY ('worker_manager_id') \
  REFERENCES 'manager' ('manager_id')
. . .
To parse the fact that the worker table and manager table are related, a <linkrule> block is used. The sql variable provides the name of the SQL statement previously defined in an <sql> block. The rx variable specifies the recipe of how the query answer should be parsed for constraints. Capturing brackets are used to return the required string. In the case of the worker table, the regular expression matches "manager":

<linkrule tablecreate>
sql  = tablecreate
rx   = REFERENCES '?(\w+)
use  = yes
</linkrule>
For MySQL databases that do not use InnoDB tables, constraints are typically implemented by application logic. In these cases, it is up to the developer to insert and delete records in appropriate tables to satisfy the constraints. For example, our sequencing LIMS database uses MyISAM tables with a specific naming convention for foreign keys. A foreign key field has the naming convention FKtype_table__field, where type, table, and field indicate an optional type of the foreign key and the table/field combination direct the reference. To support this kind of constraint specification, the <linkrule> blocks support a fieldrx variable. This variable stores the regular expression to be applied to a field name. The capturing brackets return the table and field name of the reference, as specified in the <fields> sub-block:

<linkrule fieldregex>
fieldrx = .*FK_(\w+)__(\w+)
use  = yes
<fields>
    table = 0
    field = 1
</fields>
</linkrule>
The final way to specify constraints is to use an external file that stores the table pairs. This option is useful for cases when MyISAM tables are used and the foreign key fields have no naming convention:

<linkrule linkfile>
    file = fk.txt
    use = yes
</linkrule>
Image Generation

Most of the parameters in the configuration file are dedicated to Schemaball's main purpose -- generating the actual schema image. The structure of the configuration file for image attributes is:

<image>
...    
<elements>
    ...
    <table> ... <table>
    <label> ... </table>
    <link> ... </link>
    <bezier> ... </bezier>
    <highlight> ... <highlight>
</elements>
<image>
Variables within these blocks specify the name and size of the image, the colors of the image elements, drawing options for the glyphs and link lines, as well as visibility and highlight parameters. The adjustable parameters are shown in Figure 4. For Bezier constraint lines, the Bezier radius is used to specify the location of the symmetric control point for the curve.

Coordinates of each element in the image can be obtained by toggling the variable image_element_dump. The location, size, and label of each element (table glyph, text, or link line) is reported to STDOUT. Coordinates in Cartesian (XY) and polar (Radius, Theta) coordinates are given, as well as the ROTation of the element (applicable to text labels):

BALLELEM TYPE text XY 607 140 RT 460 0.01630 ROT 1.57080 SIZE 40 LABEL car
BALLELEM TYPE circle XY 600 180 RT 420 0.00000 ROT 0.00000 SIZE 40 LABEL car
BALLELEM TYPE text XY 1040 467 RT 460 1.27946 ROT 0.31416 SIZE 40 LABEL coffeemaker
BALLELEM TYPE circle XY 999 470 RT 420 1.25664 ROT 0.00000 SIZE 40 LABEL coffeemaker 

BALLELEM TYPE bezier_line_start XY 221 723 RT 398 4.39823 ROT 0.00000 SIZE 2 LABEL project___team
BALLELEM TYPE bezier_line_midpoint XY 450 599 RT 150 4.71239 ROT 0.00000 SIZE 2 LABEL project___team
BALLELEM TYPE bezier_line_end XY 225 720 RT 393 4.40062 ROT 0.00000 SIZE 2 LABEL project___team
The image coordinate dumps can be used to generate SVG images and HTML image maps.

Tracing Constraints

When analyzing schemas it is desirable to be able to easily follow the constraints across multiple tables, thereby identifying the tables that participate in a complex relationship. For example, the company schema shown in Figure 1 shows that company is connected to manager, which connects to worker. Thus, worker is indirectly related to company. In a very large schema, such as in Figure 2, following such relationships manually is very tedious.

Schemaball implements automatic highlighting of links and tables based on a recursive constraint trace. Starting with a single, highlighted constraint link, or table, other tables or constraints links are highlighted automatically by inheritance. For example, when highlight_by_table is toggled, the highlight status of a table is inherited to all its constraints. The direction of inheritance is controlled by highlight_by_table_forward and highlight_by_table_reverse. Similarly, when highlight_by_link is used, tables participated in a given constraint are highlighted.

The number of inheritance iterations is controlled by highlight_by_iterations. The color of highlight in the image can be adjusted with each iteration. For example, Figure 3C highlights all constraints starting from the coffeemaker table for two iterations. The constraints removed by one table from the worker table are shown in a lighter shade.

Feature Examples

Control by Regular Expression

The visibility and highlight of elements in an image are controlled by regular expressions. This manner of control provides flexibility in specifying the elements. Regular expressions are applied to table names and to constraint names, which have the format table1__table2, where table1 contains the foreign key pointing to table2.

Let's take the company schema from Figure 1 and focus on the how the worker and coffeemaker fit in. Let's hide the manager, car, company, and project tables, highlight the worker and coffeemaker tables, and highlight any constraints emanating at the worker or coffeemaker tables. Below are the relevant parts of the configuration file:

<table>
<hide_rx>
    man
    car
    com
    pro
</hide_rx>
<highlight_rx>
    wor
    coffeema
</highlight_rx>
invisible_hide = no
</table>

<link>
<highlight_rx>
    ^worker__
    ^coffeemaker__
</highlight_rx>
</link>
The result is shown in Figure 5A. When invisible_hide is toggled, the hidden tables are removed from the schema altogether and the visible tables are redistributed along the circle (Figure 5B). Figure 5 also shows the direction of the constraint -- the link line is made thicker near the table with the foreign key. Currently Schemaball does not display the cardinality of the constraint.

Development Templates and Anonymous Schemas

Optimizing a schema frequently involves constraint rearrangement. This process can be facilitated by producing a schema template and drawing the links by hand. Links can be turned off using:

<link>
show_links = no
</link>
This produces Figure 6A. If the exact details of the schema are proprietary, but an illustration of the constraints is required, the tables may be made anonymous, such as in Figure 6B. Alternatively, the table labels may be removed.

<table>
labels = yes
anonymize = yes
</table>
Highlighting

Highlighting based on constraint tracing is useful to isolate related tables and constraints in the image. Using the company schema from Figure 1, let's start with the car table and traverse constraints in the forward direction (i.e., from referent to reference). The normal and highlight colors for tables and links are defined, and the car table is highlighted by way of the regular expression in a <highlight_rx> block:

<table>
fill = 102040  # dark blue
fill_hi =  4080ff  #light blue
outline = 000000
outline_hi = cccccc
stroke = 3
<highlight_rx>
    car
</highlight_rx>
</table>

<link>
color = 402010 # dark orange
color_hi = ff8f40 # light orange
stroke = 2
stroke_hi = 3
</link>
The resulting schema image is shown in Figure 7A, in which only the car table is highlighted. To trace the constraints across the schema, the <highlight> blocks is used. The highlight property is inherited from links to tables (highlight_by_link) and from tables to links (highlight_by_table):

<highlight>
highlight_by_link = yes
highlight_by_table = yes
highlight_by_table_forward = yes
highlight_by_table_reverse = no
highlight_by_iterations = 1
</highlight>
After one iteration of inheritance, Figure 7B is obtained. The highlighted link is the constraint that links the car table to the manager table (car references manager). By using values of 2 and 3 for highlight_by_iterations, Figures 7C and 7D are produced.

Note that the colors of the highlight for subsequent iterations are diluted to resemble the colors of normally colored elements. This feature is controlled by:

<highlight>
fade_factor_table = 0.65
fade_factor_color = 0.65
</highlight>
The dilution is performed in HSB color space. The RGB color of an element at the n(th) iteration is RGB(HSB(color)*(1-f)^n + HSB(color_hi)*f^n). The addition is component-wise of the three-element HSB vector.

Conclusion

The circular composite schema diagram provides a scalable, high-level view of the structure of a database. It provides an alternative for cases when large and complex ERDs are inappropriate. Schemaball is a scriptable application designed to generate circular schema diagrams. Database table and constraint structure can be parsed from various sources. Support for parsing SQL schema dumps obviates the need for a live database.

Availability

Schemaball can be downloaded from mkweb.bcgsc.ca/schemaball and requires the following Perl modules: Config::General, DBI, File::Basename, and GD. To parse schemas from SQL dumps, the SQL::Translator module also needs to be installed.

References

Schemaball -- mkweb.bcgsc.ca/schemaball, martink@bcgsc.ca

ERD notation -- http://www.smartdraw.com/resources/centers/software/erdcardinality.htm

Learning UML by Sinan Si Alhir (O'Reilly, 2003) http://www.oreilly.com/catalog/learnuml/index.html

Database Administration Tools

Mysqlfront -- http://www.mysqlfront.de

phpMyAdmin -- http://www.phpmyadmin.net

EMS MySQL Manager -- http://www.ems-hitech.com/mymanager

Mascon -- http://www.scibit.com/products/mascon

Schema Designers

Database Schema Designer v0.5 -- http://freshmeat.net/projects/datadesigner

DeZign for Databases v3 -- http://www.datanamic.com/dezign

Case Studio 2 -- http://www.casestudio.com

Schema Drawing Tools -- http://savage.net.au/Ron/html/drawing-tools.html

UML Designers (Visual Paradigm) -- http://www.visual-paradigm.com

SmartDraw -- http://www.smartdraw.com

Other UML Tools -- http://www.objectsbydesign.com/tools/umltools_byProduct.html

Bezier Curves -- http://astronomy.swin.edu.au/~pbourke/curves/bezier/

Martin Krzywinski is a bioinformatics research scientist at the Genome Sciences Centre in Vancouver, British Columbia, Canada. He applies Perl to problems in physical fingerprint mapping and data processing automation. Find him adjusting the grind setting for his espresso machine in his spare time.