Database Preservation: Solved?

On November 19, 2009, in Research, Software Reviews, by Chris Prom

At the Planets workshop I am attending in Bern, Amir Bernstein from the Swiss Federal Archives demonstrated the SIARD Suite.  SIARD is set of a Java applications that facilitate the preservation of information stored within relational databases.  It can be run as a client application (on multiple platforms) or can be called and integrated with other services.  It will even run from a USB drive!

Based on what I saw in the demo, the  suite is extremely simple to use yet very powerful.  Essentially, it creates an xml instance of a DB, conforming to an XML schema that the Swiss Federal Archives developed.  It currently converts data from MS Access, Oracle, and SQL Server.  For each database being preserved, it creates content and metadata folders.  The latter contains descriptive information about the database, tables, and columns in the DB.  (The metadata is pulled from DB description fields for MSSQL and Oracle, but not Access, since the method to access the descriptions is not currently documented by Microsoft.)  A viewer is provided for the content and there is an editor for the metadata, so archivists can add sufficient contextual information to make the preserved data useful.

It stores the content of the database in one or more schema folders, each of which contains folders for tables in the DB.  It also preserves the roles, users etc that are defined in the source database.  The Swiss placed a real empahsis on data normalization and efficiency.  The data is stored in unicode and in SQL 1999.  Certain propriety extensions, such as unique datatypes in Access, are coverted to a more simple datatype conforming to the SQL 1999 standard, but the information is accurately converted.  The content and metadata files are wrapped into a base64 ZIP file  (which was necessary to allow file sizes over 4 MB),*  SIARD also includes an exporter, so that data can be placed back into a relational DB for use.   As a result, SAIRD increases DB interoperability, by making it possible, for instance, to migrate a DB from one format to another with fairly minimal fuss. Anyone who has tried to move a database from one platform to another knows how difficult such a process can be using the tools provided by the vendor!

SAIRD is also a very scalable application.  The largest database that the Swiss Bundesarchive converted contained over 50GB of data in 2.000+ tables! SAIRD also migrates binary (BLOB) data into the content folder.  Any binary data stored in the database will be accessible in the future, as long as there is a viewer to open the application. (One can imagine migrating items, if certain file types are going obsolete.)

One important point to note is that SIARD preserves only the database, not the look and feel of the DB.    Hartwig Thomas noted after the session that the Swiss State Archives policy ruled preserving applications as out of scope for this project and that they made this decision for a very specific reason–they knew that the fundamental issue that needed to be solved was preserving the data.

I think that was a smart move.  First, preserving the data is a (relatively) easier  problem to solve than preserving the entire application that serves up the data in a particular format.  Logically, it makes sense to attack data preservation first.  For many institutions, no more may be required–unless you define the preservation of ‘look and feel’ in your in your preservation plan for the records you are converting.  Realistically, the resources needed to implement such a policy would be immense, on more than a one-off basis.  There are simply so many unpredictable factors and dependencies (operating sytems, software, external libraries, possiblity of future versions of software not being backwards compatible, etc), that it would be very expensive and complex to undertake such an operation.

Given the fact that SIARD uses the standards mentioned above, we can assume that future users will be able to import the data to another database system in the future, or mine the data it with tools yet to be developed.  XML, Unicode, and SQL 1999 will most certainly be readable by future applicaitons.   But, it is important to point out, you can also access the DB using the SAIRD viewer application, which is part of the SIARD suite.  Finally, instituitions could build a web application that would allow basic access to ALL databases under their control–not just one.  Although functionality would be obviously be limited, it would provide a minimal, documentable level of data access.

I have requested a copy of SIARD (one has to register and be provided the URL before downloading.)  I’m eagerly waiting it so I can try the software out.


*To unpack the zip file, you would need to use PKZip or another paid zip application that understands the base64 format.  However, the lead developer of SIARD, Hartwig Thomas, is also developing a sourceforge project to provide a free/open source zip/unzip application that understands base 64.

Tagged with:  

Comments are closed.