Email Preservation Options

On November 17, 2011, in Research, by Chris Prom

I have not been able to post in quite a while, since I’ve been wrapped up with other duties. But, to break the silence, here is a brief peice outlining some basic email preservation options, which I recently wrote for publication in an upcoming edition of the Midwest Archives Conference Newsletter:


The prominent Atlantic journalist and blogger James Fallows recently described how an email hacker destroyed records having great personal value: his wife’s entire Gmail archives, covering many years of her life.[1]  Although Fallows’ story ended happily, with the records being recovered through insider connections at Google, it seems likely that little email correspondence is currently being saved and preserved for its historical value, for the population generally.

In a follow-up piece, Fallows noted how the email records of a prominent journalist, records likely of great historical value, were similarly lost.[2] At my own institution, one important university officer recently lost all email prior to 2010, apparently during a system migration.  An important scholar with whom I’ve been in contact related a very similar story.   The evidence I cite is anecdotal, but how many of our institutions are actually capturing records from email communications?

We in the archival community can and must help people save email in a way that makes it likely that email records will one day become research collections, openly accessible for their historical value.  In order to do this, each institution will need to develop its own rationale for a long-term email preservation project in light of local needs, institutional profiles, mandates, and policies.  Without denying the paramount importance of defining these policies, this article will provide some technical options that might be used to provide the building blocks for a set of local email preservation services, under the rubric of the two general approaches that are practicable using currently available technologies.

Option One:  The Whole Account Approach.  Institutions wishing to use this approach would capture email found on a user’s computer or account, working directly with the individual or his/her heirs.  In practical terms, many institutions that have not previously worked with email preservation may wish to begin with this method.

In many ways, it reflects the traditional archival model of capturing records at the end of a lifecycle, then taking archival custody over them.  Archivists may pursue this approach in several ways.  They might work in conjunction with IT staff to get a copy of the email in its native format, or they might deploy email capture software directly with email users.  Once the native files have been secured, they can be migrated using the migration tools such as those discussed below.  Once migrated, they should be stored in a trusted digital repository, applying appropriate preservation practices and including descriptive and preservation metadata. Oxford University has been successfully using this approach and we can all learn many lessons from their experience.[3]

Once you are comfortable accepting and accessioning entire email accounts, you will be in a good position to offer guidance and assistance to email users, helping them to ensure that critical records are retained in system-neutral formats until they are ready to to be donated to an archives.  At that point, you can apply email migration software to capture and preserve records.  Several tools can facilitate the use of this ‘whole account’ approach:

Adobe Acrobat Pro: General office applications, such as Adobe Acrobat may play a limited role in email preservation projects. When Acrobat Professional has been installed on a local workstation that also includes Microsoft Outlook, a menu item will be added to Outlook, allowing users to save individual messages or groups of messages to a PDF file or PDF portfolio.  However, messages saved in these formats will see an extreme loss of fidelity.  In particular, portions of the header will be excluded, and even if attachments are encoded correctly in the PDF file, this roundabout method of preservation poses extreme risk.

Mailstore Home: This application, which is free for non-commercial use, provides private individuals a method to back up all of their email accounts to a local computer or external drive, storing content in a proprietary format, while allowing export to system neutral formats.  Several paid versions of Mailstore are also available, for commercial or educational use.

EmailChemy: A paid application, EmailChemy can convert many proprietary and open mail formats stored as local files to open-format targets, such EML and MBOX files. It also includes a built-in email server and can migrate converted messages to another IMAP-compliant server.

Aid4Mail: A windows-based desktop application, Aid4Mail can convert many mail formats to a wide range of open and proprietary formats. It can also connect to email servers (such as Gmail and Microsoft Exchange) to directly harvest email. It includes a filtering system to exclude or include messages meeting stated criteria in the exported set, a scripting language to allow for custom export formats, and the ability to save emails directly to PDF/A format.

CERP Email Parser: A web application that runs in an open source virtual machine (Smalltalk Squeak), the CERP Email Parser was developed by the Collaborative Electronic Records Project.  The tool will transform a single or multiple MBOX file(s) into a single XML file holding the contents of an entire email account, complying with the requirements of the XML Account Schema Format, which was jointly developed by the North Carolina State Archives and Smithsonian Institution Archives.

Option Two: The Whole System Approach.  This approach refers to implementing so-called “email archiving” software to capture an entire email ecosystem, or a portion of that ecosystem, to an external storage environment. Optionally, rules can be applied either at time of capture or disposition, to specify retention periods, limited to sender, recipient, date sent, keyword, or classification. Ideally, records will be written in a system neutral format, allowing for the integration of records into a trusted digital repository.

The Social and Public Health Sciences Unit (SPHSU) in United Kingdom’s Medical Research Council, successfully uses this approach for medium term preservation, while leaving open the possibility of long-term preservation.  Since 2007, the Unit has been using a  dual-license program, MailArchiva, to mirror a copy of every sent and received message for the approximately 120 accounts managed by their Qmail server. The messages are written in EML format to an external store, located on a separate physical machine. MailArchivia keeps an index of the messages and generates a web-accessible discovery site, which includes filter and search features and which is integrated with existing authentication services. Using this interface, staff can view messages and optionally save them in EML format outside the system, from where they can be restored to the account or manipulated in other software.

The institution chose to implement this software for several reasons. By 2007, it was apparent that the volume of email on the sending/receiving server had outstripped available resources. With quotas in place, many users were writing email to local computers, losing important messages, or asking for restores from tape backup, even several years after they had deleted messages. While IT staff could accommodate most requests, they felt burdened by an inefficient storage and retrieval process, and an IT advisory committee agreed to consider other options.

The system was put in place with a few policy guidelines, which have been incorporated into the general IT policy that all employees are provided upon hire. These policies simply state that each employee has a 2.5 GB limit on their personal account and that all sent and received messages will be captured to an external archives, which can be accessed at any time via a web browser. In addition, an employee’s supervisors are provided access to the account, and employees are told the software will continue to mirror their account for at least six months after they leave employment, after which their Qmail account will be deleted. Employees can export messages from their archive at any time if they desire a personal copy. If necessary, system administrators can export a large volume of messages in EML or other formats, for import to other systems.

In short, email archiving software provides an institution with the ability to put a policy of medium-term preservation into action. Although Mail Archiva is one option, providing an open source and an enterprise licensing options institutions may also wish to pursue other tools, such as the Symantec Enterprise Vault, which includes tools that integrate directly with enterprise email servers, such as Microsoft Exchange.

By using emergent tools and services to put time-tested archival concepts into practice, we in the archival community can provide the essential service of email preservation to our organizations and to individuals.  While the tools discussed do not solve the human problem (convincing a donor to trust you with their email), they do provide a foundation on which the personal relationships and policies can be built.

 


[1] James Fallows, “Hacked!,” The Atlantic, November 2011, http://www.theatlantic.com/magazine/archive/2011/10/hacked/8673/# .

[2] James Fallows, “Today’s Email Real-Life Scare Story – James Fallows – Technology – The Atlantic,” The Atlantic Blogs, November 2, 2011, http://www.theatlantic.com/technology/archive/2011/11/todays-email-real-life-scare-story/247791/.

[3] Chris Prom, “Receiving and Managing Email Archives at the Bodleian Library: A Case Study,” Practical E-Records, August 12, 2011, http://e-records.chrisprom.com/?p=2200.

Tagged with: