As I’ve noted in the past, it is vitally important that file formats be correctly identified if we wish to preserve them for future preservation and use. The best tool to do this, DROID, relies on the PRONOM database.
For that reason, I was pleased to see that the National Archives (UK) just announced a major expansion of the database. This work should be immediately available to anyone using DROID, and hopefully the results of this work are able to be made available in other tools, such as the Duke Data Accessioner and Archivematica, which use DROID—when the signature files are updated.
I also noted that the UDRF project, which seeks to provide a long-term replacement to PRONOM and the GDFR project, has now received funding from the NDIIP, and is moving into an active development phase. Details are a bit sparse, but I hope to hear more at the 6th International Digital Curation Conference, where I’ll be presenting later this week. It looks like may be a live feed of the conference (or at least some interviews, tweets, etc) at an idcc Netvibes site.
Over the past day, I have been testing tools for appraisal, using records from the American Library Association Office of Intellectual Freedom (OIF) the Freedom to Read Foundation (FTRF), and the Leroy J. Merritt Humanitarian Fund. The files are particularly appropriate for this purpose since they represent the completing functioning of related groups within a larger organization, since no prior appraisal has been conducted on the files, and since the files are likely to have continuing value to the organization, as well as future research value for students, scholars, and members of the public.
Under a research/nondisclosure agreement, I was supplied a snapshot of a office’s working files on July 28, 2009. Although the files were given to me for research purposes only, it is possible that the Office of Intellectual Freedom will decide to include some of the files in the American Library Association Archives, at the end of the research project.
The files comprise a complete electronic record of the office since the time that office began storing files on a shared server. The folders use a deep file structure and include a wide range of file formats. In addition, some of the materials are sensitive and will need to either be removed from the archives or placed under a restriction policy. (This is particularly the case for Merritt Fund materials, which include case files.) For this reason, it is important that potentially private materials be identified and then segregated and removed from materials to be deposited, or placed under appropriate restriction policies, in agreement with the creating office.
Obviously, one needs a semi automated way to identify potential files for inclusion. Such work could be completed either by an archivist or a records creator, but tools are needed to sort through these materials. As a result, I tested several approaches.
Continue reading »
DROID, developed by the UK National Archives, is a tool that can also assist archivists in identifying file formats. It is sometimes used as part of processes to preserve electronic records. The FITS tools, for example, make use of it to extract information concerning the identity of the file type, and the proof of concept version of Archivematica stores some of the information that DROID extracts in the archival information packet that it generates.
However, I think it may be equally valuable as part of an appraisal process, when an archivist is trying to understand the components of a particular series of records.
DROID reads internal header information from one or more files then uses a sophisticated algorithm to compares that information to signature files stored in the PRONOM database. Based on the comparison, DROID declares whether a match is ‘positive,’ tentative’ or ‘unidentified’. For each positive or tentative match, DROID provides the Pronom Unique ID (PUID), MIME type, format, and version. The exact process that the software uses is described in the technical manuals for the system, but obviously the success of the process depends largely on the completeness of the database/signature file to which DROID refers.
The tool is very helpful, but I don’t think many people outside of large scale digital preservation projects are actually using it, since it is somewhat of a power tool and since its main purpose is to support preservation of digital objects in a repository. You can download versions of it for all major platforms from Sourceforge; the stats provided seem to indicate that it has been downloaded around 8,000 times (version 4.0 1,600 times).
Aside from its use for digital preservation, it can also be used when assessing files for potential accession. In the future, DROID (or an application like it) could be even more useful. When UDFR proposal and resources such as the PLANETS Core Registry (PCR) come to fruition, particular file formats could be linked t lists of software that can render and/or undertake preservation actions for particular file types. The PLANETS tools, such as PLATO and the Testbed,, when they are released in May, may include some of this expanded functionality.
In any case, my full ‘evaluation’ of DROID, which I used to ID my test records, is after the break.
As I review my notes for the Society of Archivists Conference, I’m struck by one paper in particular: that of Malcom Todd. He reviewed the digital preservation advisory services that the The National Archives (TNA) provides to the broader archives community in the UK. (As I’ve noted elsewhere, TNA takes a much more expansive role than NARA in providing services for professional archivists, including policy planning and tools development for the entire archvies sector.) They they are hoping to ramp up this activity in providing assistance to broader UK community concerning electronic records and digital preservation planning and tools.
While many of the services and software that Mr. Todd reviewed where not new to me (e.g. DROID* and PRONOM), he provided a useful roadmap of acitivties that TNA is taking to transfer knowledge, including involvement in the “Digital Preservation Roadshows” that are co sposored by the Society of Archivists, TNA, and other organziations. He noted that there are plans to combine the work from TNA ( PRONOM) and the Harvard (JHOVE) in a combined Uniform Digital File Registry (UDFR).
There was much to chew on in his talk, but the most salient points I took away were these (just to be clear–these are my conclusions, not necessarily Malcolm’s):
- The digital curation and IT communities have far outpaced the archivists in developing tools to facilitate digital presevation work.
- Digital preservation is a solvable problem, but it is only a small part of what we need to be effective in working with e-records (I know, this point is relatively facile and in any case is not new.)
- With a few notable exceptions, few pracicing archivists with actual ‘line’ experience have been heavily involved with standards and tool development or even in testing the tools developed to facilitate electronic records work.
- It is highly impertive that line archvists become more heavily involved in technial projects. If we don’t do so, we will never influence the development of software, methods, and policies.
- There is way too much information for one person to read, assess, and assimilate, even if one limits limits oneslef but one aspect of electronic records work, such as digital preservation.
As I’ve been reflecting on all this, I’ve also been reading UNESCO-commissioned paper by Kevin Bradley from the National Library of Australia and his colleages Junran Lei and Chris Blackall at the Australian Parntership for Sustainable Repositories (thanks to Peter van Gardener for the citation). The paper provides a useful review (ciria 2007) of the state of play concerning digital repository software. It provides recommendations as to how UNESCO might assist in developing a low-cost repository system that can be used in nearly any context (including that of smaller archives and developing nations. In general the report is surprising upbeat and lays out a set of specific steps the could be taken to develop a low-cost repository system.
Both Malcolm’s talk and the Australian report leave me with a distinct sense of dread: archivists need to do much more to involve themselves in the nitty gritty of systems design and workflow management. There are many projects and tools that might be used as part of integrated workflow for electronic records, but there is precious little work being done to tie them together into a software suite that archivists could use without years of study, training, and experimentation.
For example, the Bradley paper I mentioned above notes that there are many tools to ingest and manage technical and preservation metadata for simple archival objects, but the report is silent on the issue of how descriptive metadata should be generated and/or managed in such a system (it seems to imply that each file/object will have its own descriptive record but doesn’t say how it should be created.) Similarly, a tool like DROID or JHOVE might be useful as one small part of an electronic records workflow, since it is very useful to know what kind of file you are assessing or trying to preserve. But let’s not kid ourselves–identifying file formats is only a very small part of our work for– though obviously it has implications for appraisal, arragement, description, preservation and access.
Nevertheless, if we want to work effectively with electronic records, I think we can come close to cobbling together a set of tools from existing software. Admittedly, there are likely to be gaps. One or more key functional requirments for good archival practice (such as appraisal methods) will be unmet, at least in the short term. And we need to be careful that in picking and choosing from the smorgaboard of tools that others have created we do not electronically reincarnate the workflow and management issues that left us with staggering backlogs of paper files.
Let me be the first to admit that I have compiled a gigantic folder of raw ‘electronic records’ that I hope to appraise, arrange, describe, preserve and provide access to–at some future date. At the same time, we can only gain the expertise we need to influence system design if we use, evaluate, criticize (constructively) and refine existing products and services. (Only after we have done this might we consider developing new tools.)
Where am I going with this post? Simply here: my first few weeks thinking about electronic records have shown me how much I don’t know. They also provide me the idea for a feasible workplan for the next few months . . . more on that in my next post.
*Older versions of the DROID software and a description of the project are found here.