Maintaining Integrity

On December 23, 2009, in Research, Software Reviews, by Chris Prom

A few weeks ago, Alan Bell and I had an interesting conversation with Ian Angles, head of the Servers and Storage Unit at the University of Dundee’s Information and Communication Services (ICS).  Ian is going to be helping me install and test repository applications, such as DSpace, Islandora, and RODA, over the early part of January.  As part of our meeting, Ian, Alan and I got into interesting side conversation about fixity information.

As those of you who are familiar with the OAIS reference model know, one essential element of the Archival Information Packet is  ‘fixity information’–data which is tied to an object or set of objects (such as a zip file) and which can be used to verify, over time, that the object has not been modified. Typically, fixity information is provided as an MD5 or other checksum. The Premis Data Dictionary (page 47) recommends that two or more instances of fixity information be stored for each object, and that systems run occasional events to generate new fixity information and compare it to the stored fixity information.

Obviously, ensuring that files have not been corrupted or changed is an essential element of any archives.  Systems may be unable to render corrupted records, and even small changes can render files indecipherable. Even if they can be viewed, users may judge them inauthentic.

So, recording fixity information is important.  But there are two problems with relying solely on the OAIS notion of file-level fixity:

  • Unless your archives uses a complex piece of software as its repository, most applications will not easily allow you to verify checksums on a a scheduled basis.  RODA, for instance, includes a verify checksum event–but RODA is very difficult to install and configure and not widely implemented, at least in the US.  DSpace has a ‘checksum checker’ which can be configured manually, but it does not appear to be enabled by default.  If you use a simple file system to store objects, running fixity checks would be even more time consuming or require custom scripting.
  • If errors are discovered, they would still require manual intervention to restore files–if an appropriate backup exists.  Obviously, coordinating the backup routines, file verification, and logging procedures is subject to considerable human error.  For example, the earliest available backup may have taken place after the file was corrupted, unless a lot of care is put into designing the backup and verification scheme.

It would be better to have some type of redundant and automatic error checking as a supplement to (or even replacement of) these procedures.

During the meeting with Ian, Alan and I casually asked if any file systems include automatic checksum verification or error correction at the file level?  None, unfortunately do, but Ian mentioned that  ZFS, an open source file management system developed by Sun Microsystems, would provide something of potential interest to us, and that he would use it on the storage pool that he is establishing for our pilot project.  Apparently he has successfully used it in other envirorments.

Since the meeting, I’ve spent a bit of time investigating ZFS.  In addition to its other advantages, it includes a key feature of interest to archivists or anyone interested in digital preservation: when the system is set up in a mirror or RAID arrangement (e.g. two copies of all files are kept on separate locations among the entire pool of disks), as any archival system would need to be, block level fixity information (i.e. MD5 checksums) are automatically monitored while the system is live.  If an error is discovered, the error is self healed.  If the error involves media failure or corruption, the system administrator is provided a warning that the disk needs to be replaced.   For these reasons, an institution that uses ZFS (or a similar file system providing a similar self-healing  feature) as part of a repository system could have two level of fixity (file and block level).

Unfortunately, ZFS adoption has been slowed considerably by the fact that a company called NetApps filed suit against Sun for patent infringement, and the case has not yet been resolved.    I have no way to judge whether NetApps’ claims are valid, or what the outcome would be if the lawsuit were to succeed–especially given the fact that Sun has now been purchased by Oracle.  But the effect of the suit has been to throw uncertainty over ZFS.  For example, Apple has dropped a project to integrate it into its operating system for servers.  Several recent rulings  make it appear unlikely (according to Sun’s spin) that the patent infringement case will succeed, but until the case is settled, it may be unwise to place too many bets on the system, such as integrating it into another open source project.

Nevertheless, archivists can and should be aware of the technology.  Given a bit of basic knowledge about how it works, I certainly intend to discuss it with IT administrators and advocate for its usage on any systems intended to provide archival storage. For example, there are currently free versions available for several platforms, including Open Solaris and Free BSD (where it is included by default) and Linux.

Tagged with:  
  • Seth

    Using ZFS for archival file storage is something I have thought about for a while. In addition to the fixity checking functionality ZFS has built-in version control which would allow easy tracking of file migrations (http://www.miscmusings.com/2008/11/version-control-for-electronic-records.html). My biggest concern was support. I rely on library IT for our file storage administration and they currently setup to support ZFS. Additionally, from what I understand, Duke’s OIT had some problems with ZFS resulting in data losses and I was disappointed with Apple dropped ZFS support from Snow Leopard. Currently our fixity checking is based on cron, md5deep, some Perl, and coordination with back-up schedules although we are starting to hit practical limits with the existing network file system infrastructure. As a side note, I have heard at least a few people talk about using ADAPT Ace (https://wiki.umiacs.umd.edu/adapt/index.php/Ace) for their fixity checking with no complaints.