Last week, I had an interesting lunchtime conversation with Geoff Barton, who directs the bioinformatics group at the University of Dundee’s College of Life Sciences.  Going into the conversation, I had hoped that it might prove possible to work with his group to identify one or more datasets and/or applications that would be suitable for inclusion in a pilot deposit project for a pilot ARMMS e-records repository.  In the end, that did not prove as feasible as I hoped, but in the process I gained a bit of insight into the particular challenges of working with the electronic ‘papers’ of faculty members.

I was not particularly surprised to learn that the amount of data needing preservation in the scientific community is immense.  Barton told me about his own projects, as well as several international projects related to genome sequencing and protein analysis.  In some cases, data is distributed both for redundancy and because there is simply too much data to store in one physical location.  Petabytes are generated annually, and even allowing for projected improvements in storage efficiency, many projects will run out of physical space to mount storage disks within 3-5 years.  In addition, the power requirements can be immense, in the order of millions of  euros per year.  In addition,   The scientific community has a vested interest in maintaining much of this data, as it is constantly revised and supplemented, since it forms a research corpus for future work.  An expensive and no-doubt important project funded by the National Science Foundation (more comment) will be assessing ways that libraries can assist in curating large scale and dynamic data archives.

As my conversation with D.r Barton progressed, it became apparent to me that there is probably little that an average archivist can or should do on his or her own with such data, other than making very generic recommendations concerning backup, preserving context etc.  Other groups have strong interest in preserving such material, and a non custodial model seems best.  Similarly, libraries and publishers have a strong interest in ensuring accessibility to published output of research.

But what about records that might illustrate the motivations for a research project, the working out of theories, the politics of research, or the genesis of ideas?  Who is preserving such records for the modern generation of faculty?     In mentioned to Dr. Barton that records other than data sets and publications  have traditionally been the most fecund sources of archival research.1 In response, he noted that he conducts almost all of his work via email, and he feels it offers the best record of his activities, interests, and work.  (It would be interesting to know whether this is also the case for other scientists.)

Unlike most casual computer users, he has been very deliberate in preserving his email.  He keeps it the generic and easily-migrated mbox format, under his own control.  It resides an IMAP server with off-site backup, that he maintains and to which only he has access.   His log stretches back over 20 years.

Few other people probably go to such lengths, but in my (admittedly limited) experience, other academics also like to keep voluminous email logs as a personal ‘archive,’ for reference purposes.    For example, the email record of Dr. Paul Lauterbur, which I have been working to transform, covers about 20 years. It had bee left on his obsolete Apple Mac, in Euroda’s pecualir email format.  So, at least in some cases, it seems likely that such records will continue to accumulate.  Archivists will need tools to manage them (the best current option for salvaging and converting Email to a common format is the excellent and affordable Aid4Mail program).

The questions of what will happen to such records over the long term is of critical importance.  Dr. Barton, for instance, noted that he would be unwilling to donate his email to an archives until he has died.  Unless a prior deposit arrangement is made, specifiying procedures to access the server, it will likely be difficult if not impossible to access his email at that time.

Existing case studies illustrate the difficulties of processing and preserve the authenticity of electronic records deposited without any pre-custodial intervention, even when files much less complicated than email are involved. While it is possible to process and provide access to such records, one has to wonder if the procedures described in these articles are really scalable, or achievable at institutions of less stature (and funding) than the Ransom Center or Beinecke Library.2

All of which is just to say that it has made me think very carefully about the need to work actively with donors to specify the scope of a submission agreement, long before records are actually scheduled for transfer or deposit. The next template I intend to supply is a draft submission agreement/deed of gift suitable for both immediate and delayed transfers, so that (hopefully), records can be turned over to us in an easier-to-process state.  Ideally, of course, such an agreement would be provided in a processable, xml format, but since we are living in the real world, we’ll have to make due with a Word Template first.


1. Thinking about the analog manuscript collections  with which I am familiar (such as those of the physicist John Bardeen, the anthropologist  Oscar Lewis Papers, or the IOC director Avery Brundage), the correspondence they kept has been the richest and most mined part of the archives.

2. Sarah Kim, Lorraine A. Dong, and Megan Durden, Automated Batch Archival Processing: Preserving Arnold Wesker’s Digital Manuscripts,” Archival Issues 30:2 (2006): 91-106; Micahel Forstrom, “Managing Electronic Records in Manuscript Collections: A Case Study from the Beinecke Rare Book and Manuscript Library,” American Archivist 72:2 (Fall/Winter 2009): 460-77.

Tagged with:  

Comments are closed.