Last week, I had the opportunity to interview Crawford Nielson regarding the email archiving routine used by he Social and Public Health Sciences Unit (SPHSU) of Medical Research Council.  The conversation revealed a fascinating approach to preserving a complete record of all email transactions in an organization, where the focus is on ensuring current access and potential long-term preservation.

The SPHSU promotes human health by conducting research concerning the effects of environmental and social factors on people’s physical and mental well-being.   Based at the University of Glasgow, the unit employs about 100 people directly and collaborates with another 20 or so at any given time.  Unit staff maintain external research partnerships with others in the academic community and the National Health Service.  As members of the MRC and their research partners conduct research, aggregate information, analyze datasets, and write reports, they send and receive documentation concerning the research process via emal

Since 2007, the Unit has been using a dual-license program, MailArchiva, to mirror a copy of every sent and received message for the approximately 120 accounts managed by their qmail message transfer agent.  While SHUSU is using the open source/community edition, Stimulus Software also sells an enterprise edition at moderate cost, and provides detailed pricing information through their website.

MailArchiva writes messages in .eml format to an external store, located on a separate machine than the sending/receiving server.  It keeps an index of the messages and generates a web-accessible discovery site, which includes filter and search features and which is integrated with existing authentication services.  Using this interface, staff can view messages and (optionally) save them in .eml format outside the system, from where they can be restored to the account or manipulated in other software.

The institution chose to implement this software for several reasons.  By 2007, it was apparent that the volume of email on the sending/receiving server had outstripped available resources.  With quotas in place, many users were writing email to local drive, losing important messages, or asking for restores from tape backup, even several years after messages had been deleted.  While most requests could be accommodated, IT staff were burdened with an inefficient storage and retrieval process.  Against this background, the IT system manager, Crawford Neilson, researched available options and consulted with his IT strategies committee, made up of unit researchers and other stakeholders.  Since the unit director had an ongoing interest in the topic of digital preservation, and since members of the committee expressed a strong preference for treating email as a record of their research activities, it seemed reasonable to create an archives of the entire set of messages sent or received.

From Mr. Neilson’s perspective, the system has proven successful.  It was easily installed alongside their existing qmail server and, since it runs on a different machine, it causes little additional load on the server running qmail.   While the software allows several configuration options, MRC has chosen capture messages every 15 minutes, using the fetchmail utility.  Over three and a half years, it has captured in excess of 800,000 individual messages and has solved the previous storage problems.  Since it uses single instance storage for all messages and compresses files, the total storage volume for all email traffic sent by the 120 users over three years is less than 30 gigabytes.

The system was put in place with a few policy guidelines, which have been incorporated into the general IT policy that all employees are provided upon hire.  These policy simply states that each employee has a 2.5 GB limit on their personal account and that all sent and received messages will be captured to an external archives, which can be accessed at any time via a web browser.  In addition, an employee’s supervisors are provided access to the account, and employees are told the software will continue to mirror their account for for at least six months after they leave employment, after which their qmail account will be deleted.  Employees can export messages from their archive at any time if they desire a personal copy.  If necessary, system administrators can export a large volume of messages in .eml or other formats, for import to other systems.

The software also includes the ability to set retention periods for particular types of records.  Currently the SPHSU retains the archives created by the MailArchiva software for both current and former employees, as a record of their activities.  At the time the system was implemented, the MRC planned to keep six years worth of email.  However, it is possible to save messages for longer time periods, and this policy may be reassessed and the period extended in view of the overall storage efficiencies that have been achieved.  Presumably, the MRC could mark emails for very long-term or even permanent retention.  One can easily imagine an institution using software like this, then creating a complementary policy that would allow for the transfer of email from selected individuals to a repository that provided preservation for cultural or historical value.

In short, the MRC is pursuing a policy of medium term preservation and is doing so quite effectively.  The software and hardware they are using store messages in an open format that is based on the RFC 2822 and MIME standards.  The software also provides a reasonably efficient search and discovery system.  While the email archive provides end users a way to retrieve their own deleted messages, it also allows the institution to respond to potential legal and audit requests, while leaving the files in a preservation-ready format.

