Since getting back from our family holiday to Skye, I spent a bit of time working with the Xena (XML Electtronic Normalization for Archives) tool, which has been developed by the National Archives of Australia. I had used it (albeit indirectly) in the past, since it was included as the migration engine in the proof of concept version of Archivematica.
Simply put, Xena is a tool that identifies and migrates many common file formats using other open source migration tools (such as FLAC, Imagemagick, and Open Office). Files that are cannot be converted to an open format undergo ‘binary normalization’, in which the original file is converted to lossless base64 ASCII format. The normalized files can be viewed and exported (or in the case of the binary normalized files, converted back to their original binary format) using a Xena Viewer, which is included in the distribution. Xena is available under the GPL 3 license.
Based on the testing I’ve done with it, I think Xena could play a role in small archives that want to automate many file migration tasks. However, any repository wishing to use it will need to accept the guiding philosophy behind the tool and basically agree that the normalization actions that the tool completes are acceptable (more on that later.)
Xena supports migration in the general categories of compressed/binary files, audio, databases, documents, email, and graphics into a normalized, open types. If you convert a directory, all files in it and its subfolders will be treated. Using the default options, XENA first tries to guess the file types (I was not able to determining what method or software is uses to do the guessing–it does not appear to be DROID or JHOVE), then calls the appropriate conversion routine (if it is a supported file type). If it is not a supported file type, the content is simply passed through as ASCII text via Based64encoding, and placed in a xena xml wrapper; the file can be exported to its original format using the XENA viewer application.
For example, Microsoft Word Documents or Spredsheets are converted to the analgous Open Office format. Similarly, MP3 and other audio files are converted to flacformat. mbox and pst files are converted to individual xml files and an index is created. The complete list of file types converted and the actions undertaken is listed at the end of the user manual. Other files types are simply passed throung in base64. Both the converted and base64 content is wrapped in a xena XML wrapper.
In order to test it out, I downloaded the installer for both mac and windows. Xena is very easy to install in either case.
Once you have completed the basic install and started the program, you need to do a bit of manual configurtion to ensure that it is using the correct libraries and external programs to convert the files. For example, you need to point to Imagemagick or Open Office, which are called to convert image and documents, respectively:
Once I had that completed, I ran a few test conversions. The interface is simple to use, and allows you to select one or more files or directories to convert. Here is a screenshot of the basic interface:
While the normalization runs, the processeses are output to the screen:
This output is also stored in a log file, and the normalized files placed in a folder that you specify. Although I had clicked a button to preserve the directory structure, the files that I converted were not placed into subfolders as I expected.
Working with my OIF files, I was able to convert many of the files to the Xena format. For example, I ran a preselected folder of diverse file types (MS word, Wordperfect, mp3, wav, mov, mbox, mdb, pdf, CSV, xls, ppt, jpg, and tif) through it, and all were either migrated or converted to a base64ASCII format (for example, the mdb files). I also tried running through a deeply nested folder containing over 3,700 diverse files, but mostly word documents. About 2700 conversions were completed before Xena became non responsive. It appeared that Open Office had locked while converting one file, so I closed open office, the file was skipped and it continued. However, the application then became non-responsive while attempting to covert a large tiff file. I was unable to close Imagemagick so that the Xena process could stkip the file and continue. Ssince XENA walks through the files in order by name, it looks like it would be possible idenfity the file that caused xena to choke, move it, and and restart the process on the files that had not been converted. Neverthless, this is a usability issue.
Once files have been converted, they can be accessed with the XENA viewer application. Since access is one file at a time, is is a fairly cumbersome process:
The viewer application can also be used to view the normalized data in either raw or XML tree format:
I did run into a few problems using the viewer. It displayed office documents and most images well, although it can be slow to open large files. However, it took over 2 minutes to opena 72 MB mp3 files that have been converted to the FLAC format and placed in the xena wrapper (by contast, the mp3 was playable immediately and the exported flac file imported into Audacity in under 30 seconds.) Once the file had opened, the only controls are ‘play’ and ‘stop’. Similarly, it took a long time to export the file to FLAC format. These would appear to impose some signficant usability impairments, and it would be nice if Xena’s conversion engine allowed you to save the file without the wrapper Similarly, I was unable to get xena to open the emial file (mbox converted to xml), the message was ‘could not parse xml, make sure this is valid xena file. However, the input file (a 800 KB email file with about 48o messages and some MIME attachments) was a valide mbox file, so the conversion would appear to have failed.
My take on XENA
Overall, XENA is a good program, in particular for common office documents such as .doc, .wpd; it was able to convert most files that were thrown at it from my Office of Intellectual Freedom Accession. It even handled files with inaccurate extensions, such as a large number of Wordperfect files, which were flawlessly converted to Open Office format by the program. The ppt file and .xls files that it converted displayed well (the file contained only text and images, no audio or video)
However, there are some provisos that any archives should be aware of an investiagate further before using Xena. I will deal with each of these in the software evaluation categories I had previously established to represent the usability for small archives of e-records software:
- Installation/supported platforms: 19/20. As indicated above, it is easy to install, but there are a few external dependencies and configuration steps that you need to take. The manual spells them out in excellent, illustrated detail.
- Functionality/Reliability: 15/20. Obviously, an application like this cannot be expected to convert all file types, but it works very well with the types it does support. However, Xena cannot be used for normalization of some specialized records. In particular, support for email and databases is not strong. However, it is unlikely that any general purpose tool can convert them since they are stored in numerous proprietary non/standard formats. However, there is no current support for video in XENA. In addition, one needs to be aware of the exact process that will be used to convert files. In particular, the xena developers decided to convert all types of open formats and codecs; for that reason all documents are converted to open office formats. Finally, one needs to be able to live with the fact that all of the normalized files are stored in the .xena format. While this format probably meets a very specific need within the context of the National Archives of Australia, it would be h elpful if there were a bulk/batch method5+ to export files from the .xena format to the target extsions.
- Usability: 7/10. A few usability issues were noted above.
- Scalability: 8/10. It worked through a large number of files quickly, but tends to slow down, not expectedly, when converting large image or audio files.
- Documentation: 10/10. There is a very good user manual, api documentation, development wiki, bug/feature reqeust tracker. One can register for the latter.
- Interoperability/Metadata support: 5/10. The normalized data that is produced includes some metadata in the xena header, but repositories should run another tool, such as FITS on each file in order to produce more authoritative information suitable for the preservation descrption information.
- Flexibilty/Customizability: 9/10. NAA provides some excellent api documentation, and the tool can be extended via plugins, to cover additional format types. I browsed the documentation for this, and a reasonably skilled java developer could develop a plug in to cover a new file type. However, it is unclear now to modify the program to change the output, for example, to exclude the .xena wrapper. I am sure it could be done, but it looks difficult.
- License/Support/Sustainability/Community: 9/10 NAA is continuing to support the development of the tool, and has provided resosource such as featue/update tracker through source forge.
Overall ‘score': 82/100