TrID as a Processing Tool

On March 27, 2013, in Software Reviews, by Bethany Anderson

Filename extensions can tell us much about electronic records and their use; we learn not only a file’s format, but also something about the environment in which it was created and the way it stores data. Knowing these pieces of information provide us with important metadata that enable us to begin assessing genres, providing an entryway into the appraisal as well as the arrangement and description of electronic content.

Many reasons exist, however, for why files may wend their way into repositories without file extensions. Considering divergent file-naming practices over time and changes in the ways software renders extensions, as well as the multitude of events that corrupt or irrevocably alter files during the course of their lives, it is no surprise that file extensions are either missing or a vestige of a file-naming convention that has long since been superseded. Without file extensions, the file has lost a portion of what the OAIS reference model calls its “representation information” (the data the computer needs to render the file by opening it in the proper piece of software). Being able to recover such information with a file identifying utility is vital to providing access to electronic content.

As Chris has previously mentioned, unless file formats are correctly identified, it will be difficult to preserve them for posterity and research use. Likewise, he notes that many applications which are used in the management of electronic records require that records be in ‘archive ready’ form – i.e. that they have the correct file extensions and be situated in directories without other formats, etc. Applications such as JHOVE and DROID extract technical metadata and can thus be exceedingly useful tools to use in identifying file formats. As an appraisal tool, DROID’s analysis of file formats can help us better understand the components of a particular set of electronic records, though it does not append file extensions or assist in the processing of an access copy of records.

It may be helpful to use a file identifier like TrID with the capability to append file extensions in conjunction with an application like DROID during the process of arrangement to create an access copy of electronic content.

As part of a project to process the born-digital component of a hybrid collection, I noticed that file extensions were missing from a large number of files and decided to try a file identifying utility called TrID in order to process an access copy of the records. Created by Italian software developer, Marco Pontello, TrID is a CLI executable file identification utility that analyzes a file’s internal header information and compares it to a comprehensive list of file signatures, similarly to DROID. While DROID uses PRONOM’s database of file signatures, TrID relies upon an extensible list of file definitions. TrID’s file signature definitions currently consist of 5037 file types, which are continuously increasing. Users can actively add to the definitions by utilizing TrIDScan, which analyzes the binary format of a set of files of a known file type in order to build a recognizable signature. This is exceedingly useful, especially if one has a large number of file formats not recognized by TrID. Unless a file format is not within TrID’s extensive list of file signatures, virtually any kind of file can be identified and the correct file extension appended. For those who are not comfortable with the command line, it is worth noting that TrID has a GUI called TrIDNet, as well as an online version.

Installing the CLI version of TrID proved to be fairly simple — I downloaded the zip file along with its file-type definitions from the website. I saved the file definitions in the same folder as the application, since the program looks there for the file by default. TrID supports various arguments, or command options, that can be used to modify its behavior (Figure 1).

Figure 1 - TrID command options.

Figure 1 – TrID command options.

For instance, it can be told to append file extensions automatically (the “-ae” flag), replace existing file extensions with the guessed extensions (the “-ce” flag), or to leave the file unmodified (default behavior). In addition to analyzing files and appending extensions, it also provides a percentage of confidence as a gauge of accuracy for its identificatory capability (Figure 2). While having this information is certainly important, it is hard to evaluate the significance of ambiguity represented by statistics where the file has not been identified with a high percent of confidence. Nevertheless, I noticed that TrID proved to be an effective processing tool for an access copy of electronic content and was able to identify the vast majority of the file types that had previously been elusive.

Figure 2- TrID identifies different file types with varying degrees of confidence.

Figure 2- TrID identifies different file types with varying degrees of confidence.

Despite TrID’s effectiveness, it is worth noting that it does have some shortcomings. Although it is very good at identifying ASCII files, it is not especially adept at distinguishing between different flavors of ASCII. Because ASCII files are somewhat of a “digital free for all,” this limitation is understandable. TrID provides this warning when analyzing ASCII files: “TrID is best suited to analyze binary files!” (Figure 3). The vast majority of the time this will not be problem, as the mere identification of the digital object as an ASCII file is sufficient for most processing work that is aimed at bit-level preservation. An instance may arise, however, where this could be problematic. For example, after running TrID against an ASCII PKCS7 digital certificate file (file extension .p7b), TrID correctly identified the file as an ASCII digital certificate, but thought it was merely formatted as a CER file. That is not incorrect, but it is also not as correct as it could be. It is apparent that TrID loses some granularity in its identificatory acumen when analyzing ASCII files.

Figure 3 - TrID output after analyzing an ASCII file.

Figure 3 – TrID output after analyzing an ASCII file.

It is also important to note that while TrID supports wildcard characters, and can evaluate many files with single command iterations, it is unable to simultaneously inspect the files in a directory and a subdirectory. The practical import of this is that each directory and its subdirectories need to be evaluated separately. Furthermore, TrID currently lacks the capability to generate reports, which would be important to include in an Archival Information Packet. These two criticisms, however, must be tempered with the knowledge that TrID is a very lightweight utility that can be easily called from, or integrated into other programs. A relatively simple script can be written to find all the files in a given set of directories and subdirectories, and those files paths can be passed to TrID as variables. For instance, a Windows PowerShell script can be used to recursively analyze a directory structure to append file extensions in subdirectories. Similarly, if one has a report writing or data analysis program, TrID can be easily called from that program to give it enhanced functionality.

TrID is a very effective and powerful utility that is straightforward to use and can be an exceedingly practical tool to incorporate into an electronic records processing workflow. It already has a vast collection of file definitions, the list of which will grow as users continue to engage with the program. As a processing tool, TrID complements such applications as FITS, JHOVE, and DROID, though its important to note that it should be only be utilized for the creation of an access copy of records, not a preservation copy. In order to truly test its robustness, as well as its compatibility with DROID, it would be an interesting exercise to compare the TrID’s results with those of DROID in the analysis of the same set of files. Such a comparison would yield a more telling picture of the accuracy of our digital tools and lead to new ways they can be implemented and integrated into an e-records processing workflow.

Evaluation Criteria

  • Installation/Configuration/Supported Platforms: For a novice like me, installation of TrID was easy, as it only required the downloading the program and its definitions in a zip file. It can be run in a Windows or Linux/Unix environment as well. 20/20.
  • Functionality/Reliability: Despite the ease with which TrID can be integrated into existing or new workflows as a processing tool, its GUI option, while user-friendly, can only analyze one file at a time; it does not provide an automated way to analyze multiple files in a directory at once as one can on the command line. Additionally, since it does not distinguish among ASCII files, this can be problematic if one is processing records that contain a large number of ASCII files.  18/20
  • Usability: For those familiar with the command line, TrID is easy to use. As mentioned above, a GUI option is available as well.  10/10
  • Scalability: Since the GUI option can only be run against one file at a time and the command line option can only be run against single directories, the utility has limitations dealing with large datasets. 7/10
  • Documentation: The utility is so user-friendly that it require little documentation; however, it would be nice if the website would provide a means to evaluate the analysis of file types where it does not have 100% confidence.  7/10
  • Interoperability/Metadata Support: Since anyone can add to the TrID file definition library, it’s very interoperable because it has a built-in learning capability. 10/10
  • Flexibility/Customizability: While TrID is not itself customizable, it is a utility that can be integrated into other programs. 9/10
  • License/Support/Sustainability/Community: TrID is freeware that can be used for personal, non-commercial, educational, and research use. It also has an informal online support forum. Because the TrID definitions can added to by anyone, the value and capabilities of the utility will grow over time as it gains more users. 10/10

Overall Ranking: 91/100


 

Tagged with:  

Comments are closed.