MAC Newsletter Article
MAC Newsletter, Electronic Currents, January 2009
Practical Tools for Managing Electronic Records
In the last issue, Mark Myers described why and how he became an electronic records archivist. I hope Mark’s story struck a chord with many of you. It certainly did with me, mostly because he demystified the process of learning about electronic records management and preservation. In Mark’s words, the main skills that an archivist needs to deal with electronic records are simply “acceptance and a willingness to learn.” It was nice to hear someone as highly respected as Mark say that. The many standards, projects and technologies for e-records management can seem overwhelming to anyone attacking (or being attacked by!) the field for the first time. For that reason, I’d like to use this column to highlight some specific tools that may help us all get started with the business of effectively managing electronic records. While I can’t claim that using these will solve all of your e-records problems, they will help you think about issues in a practical way, so that you can begin developing a complete electronic records strategy for your institution.
As a bit of an aside, I should mention that I feel very fortunate at the moment, since I have ten months to think about these issues. During this time, I am taking sabbatical from the University of Illinois and am in residence at the University of Dundee, Scotland, where I am conducting a Fulbright-sponsored research project titled “Practical Approaches to Identify, Preserving, and Providing Access to Electronic Records.” That probably sounds more impressive than it really is: I am simply attempting to make some sense for myself out of the plethora of projects, standards, and software tools that can be used to facilitate trustworthy archival work with electronic records. Hopefully, the lessons I learn will also help others who, like myself, have a strong interest in electronic records but do not have the luxury of taking time off from work to do it.
As a newbie to the digital preservation world, I have learned only one thing over the first three months of my project: given the current state of knowledge, any archivist can take immediate steps to implement a program that complies, at least in general terms, with the requirements necessary to establish a trustworthy program for managing electronic records.1 Within a year, the task will be even easier since a host of new tools and services are coming close to maturity.
To help others in this task of implementing some of these tools, I am evaluating software and developing a ‘cookbook’ of policy and procedures templates as well as links to software and services. It is my hope that archivists will be able to use these resources to establish archival programs for identifying, capturing, preserving, and provide access to electronic records that fall within their repository’s documentary mandate. The need is particularly acute at ‘smaller’ institutions, by which I mean those without access to computing resources typically available at large national, state, or university archives. The approach I am taking recognizes that each institution will need to adopt different approaches that reflect different funding levels, computing environments, and staff skill sets. Nevertheless, there are many common elements that all programs share, so quite of few of the tools will be applicable at many institutions.
It is far beyond the scope of this column, or indeed my limited abilities, to suggest how an general system to manage electronic records might be designed—even if such a task were possible, given the innate differences of circumstance between institutions, archivists, and programmatic needs.2 Nevertheless, I would like to describe some software or services that I have found useful at this early stage in my project. As a rubric to organize the discussion, I’ll divide the tools into those that can help you implement the OAIS functional areas of ‘Ingest’ and ‘Storage’. I will end the column on a hopeful note, by mentioning some emerging projects that will very soon provide us all with some string to tie many tools and services into a complete e-records processing package.
The OAIS function known as “Ingest” covers everything under the analog-world headings of appraisal, assessment, arrangement, description, and accessioning. If you are thinking that this covers a great deal of ground, you are correct. Ingest activities include everything necessary to successfully take custody of records from a creator, prepare them for storage, and submit then to a repository for that users can find and use them. E-records require the same arrangement and description steps as analog records, but extra work needs to be done to preserve them and make the usable. That is because digital records can only be read and understood after machines and technologies have accurately rendered them, unlike analog records, we require merely an intelligent (or at least sentient) human agent.
Unfortunately, the ingest functions we will examine begin with a software gap. There is currently no stand-alone application that facilitates the automated, orderly submission of electronic records from records creators to repositories. But there is hope. Tufts University’s TAPER project is developing an XML standard, as well as associated tools, to create and manage submission agreements and the e-records they describe.3 Submission agreements are important because they document the exact terms of a particular deposit, in particular its provenance, structure, and access requirements. Submission agreements also serve as the basis for developing a preservation plan. Although Tufts has not yet released any software to manage them, they do provide some templates, which would serve as an excellent basis for the local policy development. Until the tools are released, using electronic submission agreements in a printed or ideally in a structured electronic format, would assist a repository in collecting critical information that will facilitate subsequent collection management.
Once a repository has received the file described in a submission agreement, it can use many tools to extract metadata, describe the files in a descriptive system, and prepare them for deposit into a repository. The New Zealand Metadata Extractor is one of the best.4 It can be used on a stand-alone basis and is also being incorporated into other tools. Taking as input many common file types (it supports many image and document formats), it will generate an XML file containing the essential technical and descriptive metadata that is stored internally in the files. That metadata can then be stored locally or loaded into a preservation repository.
A repository may find it necessary to transform objects to a different format for preservation. Although a complete set of tools for completing transformations is beyond the scope of this article, I would like to mention some covering one critical format: Email. The excellent application Aid4Mail will transform an entire email folder or set of folders from one of the many proprietary formats into another proprietary format or, more appropriately for archivists, into the generic mbox format.5 While the mbox format itself may not ideal for long-term preservation, converting email to it offers medium-term format stability and the ability to effect further transformations in the future. For example, the Smithsonian Institution Archives and North Carolina State Archives both provide email parsers that convert mbox files to the XML standard recently developed for email.6 They are also hoping to create a viewer application for the new XML email format.
There are many tools available to migrate other file formats, such as images, documents, audio and video. However, migration applications can potentially cause many problems, because they do not always effect the conversion in a way that preserves the most important significant properties of an object or group of objects. That is where the XCL tools come in: Critically, the characterization and comparison engine will allow a repository to compare the significant properties of files before and after conversion from one format (such as Microsoft’s .doc format) to another (such as PDF-A).7 Using such tools, you can verify, for example, that the colors in a photograph were correctly converted.
Drawing the above services and tools together (as well as many others), PLATO is a very useful tool than can help any repository develop a preservation assessment and strategy for a specific set of electronic records that is being considered for deposit or has recently been accessioned.8 PLATO incorporates several other tools (such as the file identification tool DROID and a file ‘comparator’.) These tools allow a repository to identify the specific types of files needing processing. PLATO will also recommend the use of specific tools or services to undertake preservation actions. It includes the XCL characterization tools mentioned above, so you can verify that files were migrated in a way that preserved their significant properties, and since PLATO is a service, you can use it via a web browser. The Java-based source software is under an open source license and can be installed locally.
Finally, much work is currently being completed to develop what the OAIS reference model terms a ‘submission’ or ‘archival’ information packet. Briefly, such tools will package digital objects and their associated message into a bundle that can be deposited into repository software or a filing system, such as those listed below. Unfortunately, many of the current ingest tools are linked to particular projects or require extensive customization. For example, Tufts University provides a prototype system, which works with some Fedora applications.9 Similarly, Harvard has developed a METS/Java toolkit, which can be used to wrap metadata in a form suitable for deposit to many repositories. Another highly-developed and flexible application is SWORD, which can interface directly with many DSpace and Fedora-based repositories.10
Ultimately, these are very promising tools, but it should also be noted that implementing them would require significant IT support. Many of the archival storage options discussed below also include methods to ingest digital objects and their associated metadata. Long term, such tools offer a better option.
Archival Storage, Data Management, and Access
Theoretically, an archives could design a local, file-based repository for e-records. Such a system would of course, require adequate backup and the use of file naming conventions to link files to their associated metadata. Thankfully, much better options exist. At the lowest possible level of complexity, a repository could work with IT to install a file-based repository manger/system, such as Apache Jackrabbit. However, Jackrabbit may require the use of a separate description and access system.
One promising application ties archival storage, data management, and access into a convenient package. RODA is a Portuguese project that developed a repository application specifically meeting the needs of archivists.11 It is ‘one stop shopping’ approach to storing and managing electronic records and their associated descriptive, preservation, and structural metadata. Repositories can also track preservation actions undertaken to reformat or migrate materials. Since it is based on Fedora, it is also extensible. Implementing the open-source RODA software would require some IT support, at least initially. Other such projects, such as Archivematica, also exist, and they are discussed more fully on my website.12
One of the most important things in life is hope. Electronic records are no different, and I hope this short overview has given you some optimism that you too can indeed tame the e-records beast. Over the next several years, we’ll continue to use this space to explore, in much more details, ways to do just that.
1. To be a bit more precise, I believe it is possible for any archives to design and implement a program that puts into practice the principles and requirements of the OAIS Reference Model. The best short introduction is Brian F. Lavoie, “The Open Archival Information System Reference Model: Introductory Guide.” DPC Technology Watch Series, Report 04-0: January 2004. http://www.dpconline.org/docs/lavoie_OAIS.pdf
2. I am deliberately leaving out the OAIS functions of Preservation Planning, Administration, and Dissemination from the discussion. Engaging in these activities is critical to the success of any program, and the specific tools should be used in the context of an overall preservation strategy as well as procedures to implement that strategy. As my project progresses, I hope to offer templates to be used to assist in developing the planning and administration program, under the Recommendations tab at the project website, http://e-records.chrisprom.com. Regarding dissemination, many of the storage tools under development also include packages to disseminate and make useful materials included in the storage repository.
3. http://dca.tufts.edu/?pid=136. Accessed November 11, 2009
4. http://meta-extractor.sourceforge.net/. Accessed November 11, 2009.
6. See http://siarchives.si.edu/cerp/progress.htm and http://www.records.ncdcr.gov/EmailPreservation/default.htm for more information and tool downloads. Fookes Software, the publisher of Aid4Mail, plans to include a powerful scripting engine in version 2 of the software. It would obviate the need for a separate xml conversion engine, since users will be able to completely control export formats. It should be available by the time this column is published.
8. http://www.ifs.tuwien.ac.at/dp/plato/intro.html. Accessed November 11, 2009.
12. See http://e-records.chrisprom.com/?page_id=401 and http://e-records.chrisprom.com/?page_id=198