Using DROID for Appraisal

On February 17, 2010, in Research, Software Reviews, by Chris Prom

DROID, developed by the UK National Archives,  is a tool that can also assist archivists in identifying file formats.  It is sometimes used as part of processes to preserve electronic records.  The FITS tools, for example, make use of it to extract information concerning the identity of the file type, and the proof of concept version of Archivematica stores some of the information that DROID extracts in the archival information packet that it generates.

However, I think it may be equally valuable as part of an appraisal process, when an archivist is trying to understand the components of a particular series of records.

DROID reads internal header information from one or more files then uses a sophisticated algorithm to compares that information to signature files stored in the PRONOM database.  Based on the comparison, DROID declares whether a match is ‘positive,’ tentative’ or ‘unidentified’.   For each positive or tentative match, DROID provides the Pronom Unique ID (PUID), MIME type, format, and version.  The exact process that the software uses is described in the technical manuals for the system, but obviously the success of the process depends largely on the completeness of the database/signature file to which DROID refers.

The tool is very helpful, but I don’t think many people outside of large scale digital preservation projects are actually using it, since it is somewhat of a power tool and since its main purpose is to support preservation of digital objects in a repository.  You can download versions of it for all major platforms from Sourceforge; the stats provided seem to indicate that it has been downloaded around 8,000 times (version 4.0 1,600 times).

Aside from its use for digital preservation, it can also be used when assessing files for potential accession.  In the future, DROID (or an application like it) could be even more useful. When UDFR proposal and resources such as the PLANETS Core Registry (PCR) come to fruition, particular file formats could be linked t lists of software that can render and/or undertake preservation actions for particular file types.  The PLANETS tools, such as PLATO and the Testbed,, when they are released in May, may include some of this expanded functionality.

In any case, my full ‘evaluation’ of DROID, which I used to ID my test records, is after the break.

Working with DROID

I tested DROID on a copy of Windows XP that I have running in Virtual Box.  I also installed a version under Linux Ubuntu.  The instructions in the readme.html file  were clear, but required using the terminal in sudo mode, and the install packet did not contain the droid.sh file which is mentioned in the documentation.  Instead, I had to launch droid.jar from the terminal.  Similarly, a app is available for the Mac, but double clicking the .app file did not launch Droid, although I was able to get it to launch directly from the droid.jar file, and I subseqently compeleted some of my testing on the Mac.

There are two basic ways that DROID can be run: in standard mode or in ‘profile’ mode.  Output in standard mode is in DROID’s XML forma , and records can also be saved to a CSV file.  (The application can also be called from the command line, but I did not have time to investigate this; it is described fully in the API documentation. It does not look as if custom outputs can be specified, but results can be written to an xml or csv file as from the graphical interface.)

Once I had it running, I ran DROID against two sets of records–the raw files I received from the ALA Office of Intellectual Freedom (OIF) (23 GB in 31,971 files and the attachments for the Paul Lauterbur email file (1GB in 6,900 files). In both cases, Droid moved through the files in a few minutes running in standard mode.  The XML output, which would be suitable for writing into an Archival Information Packet, is shown in figure one.

Figure One: Droid's XML output

Figure Two shows the CSV output loaded into Excel.

Figure Two: CSV Output from Droid Standard Mode

Running in profile mode, DROID took 1 hour 37 minutes and 19 minutes, respectively, to index the OIF and Lauterbur files. When DROID was finished running the profile, it presented two summarized, tabular views of the data in a pop up window: a count and total byte size of files by year last modified, and a count and total byte size for files by PUID (see the figure below).  (Unfortunately, the only way to capture this data is to manually cut and paste it into a tab delimited file, then open it in Excel.  Once the pop up screen is cleared, it is impossible to get the information back in this format without rerunning the profile, which adds duplicate entries to the database.)  The information it presents is very useful, since it essential presents a collection profile by file format or genre, and additional work to process the files and gather statistical information can be completed either in the reports section

.

Figure Two: DROID Profile output shown in Excel

In any case, results of the profile were saved to a Derby database, and DROID includes several reports (in Jaspar reports format), which can be run for different profiles, and (optionally) with one or more filters applied (such as by date of modification, identification statues, or PUID, but the results are unintelligible when many formats are encountered (see figure three).  Additional reports can be defined (using a program such as iReports); once they are pasted into the Reports folder, they will be available from the drop down menu on the reports page.

Figure Three: DROID Profile Report

By comparing the information produced in profile mode to the information from the Standard CSV output, it should possible to determine which particular files will present problematic preservation challenges or to identify particular formats of material that may be outside of the proposed collection scope.  In this respect, the output is moderately useful from an appraisal point of view.  When I  examined the raw profile information for the ALA Office of Intellectual Freedom files, I noted that there were many WordPerfect for DOS files included, as well as a variety of database formats.  By searching the csv file, I was able to identify the file names and paths for particular files of interest, and then located them deep with the directory structure, where I could examine them to see if the content they contained merited retention.  Similarly, I noticed that a wide variety of audio and video file formats were included, and I was able to pull copies out of the working folder to test whether RODA or Archivematica would convert them correctly to a preservation format.

From an appraisal/assessment point of view, the CSV output could more useful if results for each file were saved to one row in the output.  For example,it would be helpful to be presented with a clean list of all file paths for files of a particular PUID or PUIDs.  Unfortunately,  results for each file are split across two or more lines in the CSV output, the first row provides the identification statue, file location/name and a warning, and the subsequent rows present one or more matches from the signature file. As a result, it is impossible to resort the data without further conversion to the CSV file (using some kind of regular expression replacement or modification to the DROID CSV output.  It would also be possible to write an XSLT/XQuery script to output the required information from the XML file into a CSV.  I did not have time to implement any of these options at this time.

My evaluation of Droid’s utility in an appraisal process at  ‘under-resourced’ archives:

  • Installation/supported platforms: 18 Installed relatively easily and runs in MAC, Windows and Linux.
  • Functionality/Reliability:  17.  After the reports were generated, the Windows version loaded very slowly and or hung on install a few times.  The tool does what is says it will very well, and will be even more useful if it includes a ‘recommender’ service for software (e.g. if PRONOM or UDFR support this.)
  • Usability: 6.  The user interface is clean and well documented, but the CSV output is difficult to use.
  • Scalability: 10.  Ran very well with very large bodies of records
  • Cost/Market Share/Sustainability/Longevity.  8.  Is well supported by National Archives, but future development path is uncertain, it seems as if many of the features are being incorporated in the PLANETS tools, but I am uncertain was to what outputs will be provided.
  • Interoperability/Metadata support:  8
  • Flexibilty/Customizability 5.   I am uncertain how the customize the xml or CSV output, and the process is not documented (so far as I can tell)
  • License/Support/Community: 8.

Final ‘Score’:  78/100

Tagged with: