Schemaball:
A New Spin on Database Visualization
Martin Krzywinski
Understanding relationships and patterns in complex database schemas
is simplified when the data can be effectively visualized. Schema
diagrams are particularly useful when numerous entities and relationships
are involved. Conventions for drawing schemas, such as the entity-relationship
diagram (ERD) and the general Universal Modeling Language (UML)
framework, provide recipes to draw a wide range of entities and
relationships (see references). Their visual vocabulary is highly
controlled, to ensure consistency, and rich, to allow flexibility.
In an ERD, a table (entity) is represented by a rectangle, constraints
(relationships) by rhombuses that link the table symbols, while
cardinality of relationships is indicated by glyphs on the head
and tail of the constraint lines. A large array of open source and
commercial tools exist to generate ERDs (see references).
Unfortunately, the ERDs for large databases quickly become difficult
to follow. At the same time, it is precisely in these cases where
schema illustrations become indispensable in development and optimization.
For example, our MySQL sequencing LIMS system, which models laboratory
protocols and stores sequence data, contains 141 tables and 205
foreign keys. Information about a single laboratory process involves
multiple constraints and tables, making the process of tracing data
flow in an ERD very cumbersome.
To more quickly grasp the large-scale structure of schemas, it
is necessary to reduce the complexity of a schema diagram while
maintaining the pertinent details. Drawing hundreds of tables with
as many (or more) constraints in a traditional ERD often produces
a disheartening jumble.
In this article, I will describe a circular compositing method
to produce schema images of large databases in a manner that limits
the complexity of the drawing while retaining constraint and table
information. Figure 1 shows a classical ERD (A) and the circularly
composited schema (B,C) of a 10-table schema that models a company
of managers and workers. The circular schema is easy to generate
and can be drawn compactly.
The circular composition has a number of advantages. Routing of
constraint lines is trivial -- each constraint is drawn independently
of others. Constraints may be drawn with straight lines (Figure
1B) or curves, such as the Bezier type (Figure 1C). The advantage
of Bezier curves is both aesthetic and practical. The constraint
lines are radially tangent at the point of contact with a table
symbol. It is therefore easy to estimate the degree of connectedness
of any table. For example, Figure 1C quickly shows that the worker
is the most interrelated entity. This information is presented more
clearly than in Figure 1A because of the symmetry and tidiness of
the circular composition. The benefits of circular composition become
more compelling as the size of the database grows. Figure 2 shows
the schema of our LIMS database with all constraints from the Employee
table (Table 28) highlighted.
The remaining portion of this article presents Schemaball, an
application to draw circularly composited schemas. Options such
as hiding tables and highlighting features (Figure 3A, 3B), as well
as adjusting geometry and following constraint hierarchy (Figure
3C), are implemented. Schemaball has been designed to facilitate
automated generation of images and is therefore a command-line application.
All settings are stored in a configuration file, making it easy
to have multiple configurations for different databases and schema
images.
Implementation
Schemaball is implemented in Perl and uses DBI to connect to databases,
SQL::Translator to parse SQL dumps, and GD to produce PNG images.
Support for external configuration files is accomplished using Config::General.
The list of tables and constraints is obtained by polling the database,
parsing SQL schema dumps, or by using flat files that list the tables
and constraints. Schemaball is compatible with any SQL server for
which DBI drivers are available, because all SQL statements used
by Schemaball (e.g., SHOW TABLES, DESCRIBE tablename) are defined
in a configuration file. In this article, all SQL statements are
standardized to MySQL format.
A variety of schema images can be generated by toggling visibility
of elements in a schema image. Visibility and highlighting are controlled
by regular expressions applied to table and constraint names. By
selectively applying visibility and highlights, a schema image can
be created as a series of layers to be manipulated and composited
downstream (e.g., GIMP). Such composites provide suitable illustrations
for publications, in which the schema needs to be represented at
a high level.
Databases and SQL Commands
Schemaball does not require a live database server to generate
images. Data can be fed into the application using database schema
dumps, which contain CREATE TABLE directives, or an external file,
which lists the table names:
<db>
### connect directly
host = neurotica
database = sequence
user = martink
password = password
### parse dump with SQL::Translator
# sqlfile= sequence.sql
### table names from file
# tablefile = tablenames.txt
</db>
SQL-related parameters are stored in a <sqldb> block. The database
type is specified by the type variable (the corresponding DBI driver
module must be installed). All SQL statements are stored in named
<sql> sub-blocks that additionally specify the column number
of the query return field to be parsed:
<sqldb>
type = mysql
<sql listtables>
sql = "show tables"
colnum = 0
</sql>
. . .
<sql tablecreate>
sql = "show create table like _TABLE_"
colnum = 0
</sql>
</sqldb>
Constraint
Parsing
Schemaball provides three ways to determine foreign keys. This
flexibility is motivated by the popularity of MySQL and historical
lack of foreign key support. Foreign keys have been implemented
for InnoDB tables since version 3.23.44, but MyISAM tables still
lack this functionality, which is planned for future releases. Foreign
key directives in TABLE CREATE statements for MyISAM tables are
parsed by the server but provide no functionality.
Constraints can be extracted from MySQL table structures by using
the statements SHOW CREATE TABLE LIKE tablename or SHOW TABLE STATUS
FROM database LIKE tablename. For example, polling the worker table
(Figure 1) with SHOW CREATE TABLE produces constraint lines like
the following:
> SHOW CREATE TABLE LIKE worker;
. . .
CONSTRAINT '0_19' FOREIGN KEY ('worker_manager_id') \
REFERENCES 'manager' ('manager_id')
. . .
To parse the fact that the worker table and manager table are related,
a <linkrule> block is used. The sql variable provides the name
of the SQL statement previously defined in an <sql> block. The
rx variable specifies the recipe of how the query answer should be
parsed for constraints. Capturing brackets are used to return the
required string. In the case of the worker table, the regular expression
matches "manager":
<linkrule tablecreate>
sql = tablecreate
rx = REFERENCES '?(\w+)
use = yes
</linkrule>
For MySQL databases that do not use InnoDB tables, constraints are
typically implemented by application logic. In these cases, it is
up to the developer to insert and delete records in appropriate tables
to satisfy the constraints. For example, our sequencing LIMS database
uses MyISAM tables with a specific naming convention for foreign keys.
A foreign key field has the naming convention FKtype_table__field,
where type, table, and field indicate an optional type of the foreign
key and the table/field combination direct the reference. To support
this kind of constraint specification, the <linkrule> blocks
support a fieldrx variable. This variable stores the regular expression
to be applied to a field name. The capturing brackets return the table
and field name of the reference, as specified in the <fields>
sub-block:
<linkrule fieldregex>
fieldrx = .*FK_(\w+)__(\w+)
use = yes
<fields>
table = 0
field = 1
</fields>
</linkrule>
The final way to specify constraints is to use an external file that
stores the table pairs. This option is useful for cases when MyISAM
tables are used and the foreign key fields have no naming convention:
<linkrule linkfile>
file = fk.txt
use = yes
</linkrule>
Image Generation
Most of the parameters in the configuration file are dedicated
to Schemaball's main purpose -- generating the actual schema image.
The structure of the configuration file for image attributes is:
<image>
...
<elements>
...
<table> ... <table>
<label> ... </table>
<link> ... </link>
<bezier> ... </bezier>
<highlight> ... <highlight>
</elements>
<image>
Variables within these blocks specify the name and size of the image,
the colors of the image elements, drawing options for the glyphs and
link lines, as well as visibility and highlight parameters. The adjustable
parameters are shown in Figure 4. For Bezier constraint lines, the
Bezier radius is used to specify the location of the symmetric control
point for the curve.
Coordinates of each element in the image can be obtained by toggling
the variable image_element_dump. The location, size, and label of
each element (table glyph, text, or link line) is reported to STDOUT.
Coordinates in Cartesian (XY) and polar (Radius, Theta) coordinates
are given, as well as the ROTation of the element (applicable to
text labels):
BALLELEM TYPE text XY 607 140 RT 460 0.01630 ROT 1.57080 SIZE 40 LABEL car
BALLELEM TYPE circle XY 600 180 RT 420 0.00000 ROT 0.00000 SIZE 40 LABEL car
BALLELEM TYPE text XY 1040 467 RT 460 1.27946 ROT 0.31416 SIZE 40 LABEL coffeemaker
BALLELEM TYPE circle XY 999 470 RT 420 1.25664 ROT 0.00000 SIZE 40 LABEL coffeemaker
BALLELEM TYPE bezier_line_start XY 221 723 RT 398 4.39823 ROT 0.00000 SIZE 2 LABEL project___team
BALLELEM TYPE bezier_line_midpoint XY 450 599 RT 150 4.71239 ROT 0.00000 SIZE 2 LABEL project___team
BALLELEM TYPE bezier_line_end XY 225 720 RT 393 4.40062 ROT 0.00000 SIZE 2 LABEL project___team
The image coordinate dumps can be used to generate SVG images and
HTML image maps.
Tracing Constraints
When analyzing schemas it is desirable to be able to easily follow
the constraints across multiple tables, thereby identifying the
tables that participate in a complex relationship. For example,
the company schema shown in Figure 1 shows that company is connected
to manager, which connects to worker. Thus, worker is indirectly
related to company. In a very large schema, such as in Figure 2,
following such relationships manually is very tedious.
Schemaball implements automatic highlighting of links and tables
based on a recursive constraint trace. Starting with a single, highlighted
constraint link, or table, other tables or constraints links are
highlighted automatically by inheritance. For example, when highlight_by_table
is toggled, the highlight status of a table is inherited to all
its constraints. The direction of inheritance is controlled by highlight_by_table_forward
and highlight_by_table_reverse. Similarly, when highlight_by_link
is used, tables participated in a given constraint are highlighted.
The number of inheritance iterations is controlled by highlight_by_iterations.
The color of highlight in the image can be adjusted with each iteration.
For example, Figure 3C highlights all constraints starting from
the coffeemaker table for two iterations. The constraints removed
by one table from the worker table are shown in a lighter shade.
Feature Examples
Control by Regular Expression
The visibility and highlight of elements in an image are controlled
by regular expressions. This manner of control provides flexibility
in specifying the elements. Regular expressions are applied to table
names and to constraint names, which have the format table1__table2,
where table1 contains the foreign key pointing to table2.
Let's take the company schema from Figure 1 and focus on the how
the worker and coffeemaker fit in. Let's hide the manager, car,
company, and project tables, highlight the worker and coffeemaker
tables, and highlight any constraints emanating at the worker or
coffeemaker tables. Below are the relevant parts of the configuration
file:
<table>
<hide_rx>
man
car
com
pro
</hide_rx>
<highlight_rx>
wor
coffeema
</highlight_rx>
invisible_hide = no
</table>
<link>
<highlight_rx>
^worker__
^coffeemaker__
</highlight_rx>
</link>
The result is shown in Figure 5A. When invisible_hide is toggled,
the hidden tables are removed from the schema altogether and the visible
tables are redistributed along the circle (Figure 5B). Figure 5 also
shows the direction of the constraint -- the link line is made thicker
near the table with the foreign key. Currently Schemaball does not
display the cardinality of the constraint.
Development Templates and Anonymous Schemas
Optimizing a schema frequently involves constraint rearrangement.
This process can be facilitated by producing a schema template and
drawing the links by hand. Links can be turned off using:
<link>
show_links = no
</link>
This produces Figure 6A. If the exact details of the schema are proprietary,
but an illustration of the constraints is required, the tables may
be made anonymous, such as in Figure 6B. Alternatively, the table
labels may be removed.
<table>
labels = yes
anonymize = yes
</table>
Highlighting
Highlighting based on constraint tracing is useful to isolate
related tables and constraints in the image. Using the company schema
from Figure 1, let's start with the car table and traverse constraints
in the forward direction (i.e., from referent to reference). The
normal and highlight colors for tables and links are defined, and
the car table is highlighted by way of the regular expression in
a <highlight_rx> block:
<table>
fill = 102040 # dark blue
fill_hi = 4080ff #light blue
outline = 000000
outline_hi = cccccc
stroke = 3
<highlight_rx>
car
</highlight_rx>
</table>
<link>
color = 402010 # dark orange
color_hi = ff8f40 # light orange
stroke = 2
stroke_hi = 3
</link>
The resulting schema image is shown in Figure 7A, in which only the
car table is highlighted. To trace the constraints across the schema,
the <highlight> blocks is used. The highlight property is inherited
from links to tables (highlight_by_link) and from tables to links
(highlight_by_table):
<highlight>
highlight_by_link = yes
highlight_by_table = yes
highlight_by_table_forward = yes
highlight_by_table_reverse = no
highlight_by_iterations = 1
</highlight>
After one iteration of inheritance, Figure 7B is obtained. The highlighted
link is the constraint that links the car table to the manager table
(car references manager). By using values of 2 and 3 for highlight_by_iterations,
Figures 7C and 7D are produced.
Note that the colors of the highlight for subsequent iterations
are diluted to resemble the colors of normally colored elements.
This feature is controlled by:
<highlight>
fade_factor_table = 0.65
fade_factor_color = 0.65
</highlight>
The dilution is performed in HSB color space. The RGB color of an
element at the n(th) iteration is RGB(HSB(color)*(1-f)^n + HSB(color_hi)*f^n).
The addition is component-wise of the three-element HSB vector.
Conclusion
The circular composite schema diagram provides a scalable, high-level
view of the structure of a database. It provides an alternative
for cases when large and complex ERDs are inappropriate. Schemaball
is a scriptable application designed to generate circular schema
diagrams. Database table and constraint structure can be parsed
from various sources. Support for parsing SQL schema dumps obviates
the need for a live database.
Availability
Schemaball can be downloaded from mkweb.bcgsc.ca/schemaball
and requires the following Perl modules: Config::General, DBI, File::Basename,
and GD. To parse schemas from SQL dumps, the SQL::Translator module
also needs to be installed.
References
Schemaball -- mkweb.bcgsc.ca/schemaball, martink@bcgsc.ca
ERD notation -- http://www.smartdraw.com/resources/centers/software/erdcardinality.htm
Learning UML by Sinan Si Alhir (O'Reilly, 2003) http://www.oreilly.com/catalog/learnuml/index.html
Database Administration Tools
Mysqlfront -- http://www.mysqlfront.de
phpMyAdmin -- http://www.phpmyadmin.net
EMS MySQL Manager -- http://www.ems-hitech.com/mymanager
Mascon -- http://www.scibit.com/products/mascon
Schema Designers
Database Schema Designer v0.5 -- http://freshmeat.net/projects/datadesigner
DeZign for Databases v3 -- http://www.datanamic.com/dezign
Case Studio 2 -- http://www.casestudio.com
Schema Drawing Tools -- http://savage.net.au/Ron/html/drawing-tools.html
UML Designers (Visual Paradigm) -- http://www.visual-paradigm.com
SmartDraw -- http://www.smartdraw.com
Other UML Tools -- http://www.objectsbydesign.com/tools/umltools_byProduct.html
Bezier Curves -- http://astronomy.swin.edu.au/~pbourke/curves/bezier/
Martin Krzywinski is a bioinformatics research scientist at
the Genome Sciences Centre in Vancouver, British Columbia, Canada.
He applies Perl to problems in physical fingerprint mapping and
data processing automation. Find him adjusting the grind setting
for his espresso machine in his spare time. |