Guest Post: Ben Goldman

On June 23, 2011, in Research, by Chris Prom

Guest posting today is Ben Goldman, Digital Programs Archivist at the American Heritage Center.  His work regarding accessioning Born-Digital Archives makes extensive use of the Duke Data Accessioner, and represents an excellent first step toward dealing with legacy digital media.  What Ben calls a ‘baseline set of requirements for a very humble electronic accessioning process’ represents an implementation similar in concept to the steps I outlined in the recommedations section of this blog.

Using What Works: A Practical Approach to Accessioning Born-Digital Archives
by Ben Goldman, Digital Programs Archivist, American Heritage Center.

At the American Heritage Center we have established a remedial process for accessioning born-digital material, a process informed and constrained by the particular born-digital material we’ve acquired (mostly disks), the limited resources and technical infrastructure at our disposal, and even the time I have to dedicate to this issue (which is, officially, 20% of my time). These limitations are realities that have so far stunted our ability to manage the born-digital material we’ve acquired.

There is ample evidence to suggest that our situation is quite common among manuscript repositories. The recently published “Taking Our Pulse: The OCLC Research Survey of Special Collections and Archives” found that 79% of 169 respondents acknowledged the existence of born-digital material in their collections, a percentage that stood, as the authors note, “in stark contrast to the 35% who reported the size of their born-digital holdings.” It also stood in contrast to the 45% who weren’t sure who was even responsible for born-digital material in their repository. In summing up their findings prior to the release of the report, one of the authors called born-digital material: “undercollected, undercounted, undermanaged, and inaccessible”.

What does it mean for an institution to have acquired born-digital material, but not be able to quantify it, manage it, or access it? At the American Heritage Center, it means most of the born-digital material we’ve acquired has arrived on floppy disks, zip disks, CDs and DVDs, which we deposited in boxes and filed away in the stacks, and have largely ignored.

As far as our internal administration was concerned, these disks were already accessioned, usually as part of much larger, mostly paper-based collections and following protocols established for analog collections. But this only makes sense logically if you consider disks—or digital media of any sort—to be items in collections, deserving of the same consideration we might give to individual documents. It is more appropriate, I submit, to think of digital media as containers of items which require the kind of archival administration we might normally reserve for boxes in a collection. In this sense, the data (files and folders) found in these containers had not been accessioned at all. In general, I think that so many archives not being able to count, manage, or access born-digital material is an indication that we have not been taking adequate steps to accession it.

SAA’s Glossary of Archival Terminology defines accessioning as taking “legal and physical custody of a group of records or other materials and to formally document their receipt.” It goes on to note that accessioning includes “the initial steps of processing by establishing rudimentary physical and intellectual control over the materials.” The SAA Glossary then provides a definition from the monograph, Keeping Archives, which says: “This initial process is called accessioning which records information about origins, creator, contents, format and extent in such a way that documents cannot become intermingled with other materials held by the archives.”

At the AHC we document legal and physical custody through forms and agreements. Gaining physical control includes briefly surveying the material for any immediate preservation concerns, like mold or pests. Establishing intellectual control includes identifying and documenting provenance, creators, contents, and formats presents at a very high-level, which we detail through paperwork and in our archival management system. We also estimate how much material a collection contains (in cubic feet), and all of this information is entered into very basic, collection-level EAD finding aids, which are to be expanded upon later by our processing archivists.

Looking at both the profession’s definition of accessioning and our own internal application of accessioning steps, it’s fairly evident to me that little we at the AHC had done with electronic records actually qualified as accessioning. We hadn’t surveyed disks for viruses, which we might think of as the digital equivalent of mold or pests. Steps to gain intellectual control had not been taken. We hadn’t attempted to identify formats or estimate the size or extent of digital material. We had no descriptive information about the files on disks, which sometimes number in the tens of thousands, and in many cases, we hadn’t even properly documented the existence of digital media in collections.

There are, however, many professionals outside of the archival tradition working to preserve electronic records, and many of them may not even be familiar with the term accessioning at all. In other arenas the OAIS Model is the framework by which procedures for acquisition, management, and preservation of born-digital material are developed, and ingest is the term used to describe the receipt and storage of an archived document.

There are obvious similarities between the two concepts that make it easy to draw a connection between the two. Brian Lavoie describes the specific functions covered by the Ingest component of the model as:

Receipt of information transferred to the [archival system] by a Producer; validation that the information received is uncorrupted and complete; transformation of the submitted information into a form suitable for storage and management within the archival system; extraction and/or creation of descriptive metadata to support the [archival system]’s search and retrieval tools and finding aids; and transfer of the submitted information and its associated metadata to the archival store.

Examining what ingest and accessioning have in common, I believe we can identify a baseline set of requirements for a very humble electronic accessioning process. Both cover much wider ground than just taking physical custody of material; both definitions also include elements of appraisal, arrangement, and description. And perhaps most importantly, both definitions seem to understand that accessioning is an important archival step that is not taken in isolation of other archival functions, that, in fact, it lays the groundwork for preservation, arrangement, description, and even access procedures.

  • OAIS says we must validate that information received is uncorrupted, while the archival notion of physical custody often includes some form of preservation check during the accessioning phase. This leads me to conclude that virus-checking should be a required component of electronic accessioning.

  • OAIS says that we should be extracting or acquiring descriptive metadata, while SAA identifies the capture of descriptive information as essential to gaining intellectual control over it. This leads me to conclude that as part of an accessioning program, we should seek basic information about the files—the file names, creation dates, and authors, if possible—in addition to information about the order of the files, that is, the directory structure of the disk, at the time of accession. All of this information will support eventual arrangement and description activities.

  • OAIS says the ingested files should be transformed into a format suitable for storage and management, while the SAA definition mentions that part of gaining intellectual control over a collection involves defining formats. This leads me to conclude that performing file format validation during the accessioning process is essential, and will support eventual preservation and access activities later.

What neither definition mentions is authenticity, which InterPARES defines as a record being what it purports to be, free from tampering or corruption. The authenticity of analog collections has traditionally been supported through the documentation we generate during the acquisition and accessioning of collection material. But digital records are inherently more fragile and susceptible to different forms of corruption, and we need to enhance our accessioning procedures a bit to meet this evolution.

Set B InterPARES Benchmark Requirements for Authenticity suggest, loosely, that archivists should take steps both to document that files have not been corrupted since arriving at the archive, and that all archival activity associated with files, including such activities as migration or normalization, are documented.

Currently, one standard practice for verifying authenticity is to use checksums. Very simply, a checksum is a digital signature in the form of a string of characters that can be generated for any digital file. If the file changes in any way, either due to tampering or just bit rot, that digital signature will change, indicating the alteration.  Capturing a checksum for every file accessioned, then, is another step to add to this developing list of accessioning requirements.

While Set B identifies the need to document relationships between original files and any copies made for access purposes, it also demands that we document any changes or work done on files over time, a requirement echoed in other literature on digital preservation. While this is not truly an accessioning concern, I think this documentation needs to begin the moment we accession a digital file, if not the moment we acquire it. So the final requirement I’ve identified for the AHC accessioning workflow is to begin this documentation and digitally file it in a location with the accessioned material.

The final tally of requirements for accessioning digital files from disks is:

  • perform a virus-check
  • capture descriptive metadata about the files and folders on a disk
  • capture data about file formats
  • capture checksums for every file
  • begin documentation that records the management and preservation actions taken over time

Having established the theoretical foundation for our work, and identified a baseline set of requirements, the challenge becomes finding practical, manageable ways to meet these requirements. All of these steps can be done manually for each file, using a variety of very basic software, but this can be labor intensive and won’t scale well as the size of our born-digital holdings increase. I looked at a number of software options, but ended up choosing the DataAccessioner, developed by Seth Shaw, Electronic Records Archivist at Duke University.  The DataAccessioner transfers, en masse, the entire contents of a disk to our networked storage. For each disk accessioned, it produces an XML-structured file that documents the full directory structure of the disk, a checksum value for every file on the disk, and identifies file formats using DROID and JHOVE plug-ins.

So when I pop a disk into my computer, the first thing I do is run a virus check. Assuming the disk passes without issue, I’ll use the DataAccessioner to transfer the entire contents of the disk to our network storage, while at the same time capturing checksums, directory structures, file names, dates created/modified, and file formats in the associated XML file. While this is running, I’ll create a new document to track the management and preservation actions taken over time with files from this disk, with the first item on the list documenting the accession itself.

And I actually do this twice for each disk. Our digital storage at the AHC is separated into two separate virtualized file servers, one for “masters” and one for “access”, with both locations backed up remotely. The “master” copy is completely restricted, effectively a dark archive. To support eventual use by staff and researchers, while protecting the master from inadvertent alteration, a second copy—the “access” copy—is required. Using this process we’ve thus far accessioned over 150,000 files, comprising a little over 200 gigabytes of data.

I consider what I’ve outlined here today to be a baseline for beginning to address the accessioning of born-digital material, a workflow that can and should evolve over time, that can and should accommodate increasingly complex steps as needed, as the resources at our disposal increase, or as technology evolves. For better or worse, our culture is now largely digital, and technology will remain fluid. To keep up, we’ll need to remain agile with our thinking and our practices, and that’s what we intend to do at the AHC with the process I’ve outlined.

But however remedial, with these initial steps we have effectively made our institution one that can count, manage, and access its born-digital holdings, unlike the majority of institutions surveyed in the recent OCLC report. We now have the ability to determine the extent of our digital holdings, in bytes, and we have the digital files transferred to network storage, secure and redundant, and separated from more fragile media like floppy disks. With the information we capture at the time of accession, we’ve started to actively manage these files, started to gain some intellectual control, and should someone request access to a disk’s contents, we could provide it in our reading room within a matter of minutes.

© Copyright Ben Goldman, All rights reserved.

Tagged with: