Web Archiving Service Evaluation

On February 1, 2011, in Software Reviews, by Megan Toups

This is the fourth installment in a series of evaluations of website harvesting software on the Practical E-records blog.  The first three installments were reviews of open source software that you can download and install locally—HTTrack, GNU Wget free utility, and Heritrix.  This fourth installment is a review of the Web Archiving Service (WAS) developed by the California Digital Library, which is a fee based service for capturing and storing websites.

Continue reading »

Heritrix Evaluation/Review

On November 17, 2010, in Software Reviews, by Megan Toups

This is the third installment in a series of evaluations of website harvesting software on the Practical E-records blog.  The first two installments were reviews of the HTTrack open source software and the GNU Wget free utility.  This third installment is a review of Heritrix, the Internet Archive’s open source web archiving software.

Continue reading »

Duke DataAccessioner: Review

On October 26, 2010, in Research, Software Reviews, by Angela Jordan

As Chris has noted previously, it is important for archives at ‘small’ repositories to rapidly complete basic archival tasks, like bulk file identification, transfer, and processing.  For a long time, he had been meaning to test out the Duke Data Accessioner.  Last week, he turned the task over to me, as part of a project to process the Ed Kieser Papers.

The “Duke DataAccessioner” is a free, open source program that can be downloaded to your desktop and used to migrate data from physical media or directories and into a dedicated file server/directory structure for preservation, further appraisal, arrangement, and description;  it also provides a way to integrate metadata tools at the time of migration.

Continue reading »

GNU Wget Evaluation

On September 12, 2010, in Software Reviews, by Megan Toups

This is the second installment in a series of evaluations of website harvesting software on the Practical E-records blog.  The first installment was an evaluation of the HTTrack open source software and this installment reviews the GNU Wget free utility.

GNU Wget is designed to be used for a variety of different retrievals—HTTP, HTTPS, and FTP— but today I am evaluating it only in light of its use as software for website capture.  According to the GNU Wget website the software can be used on any “UNIX-like operating systems as well as Microsoft Windows.”  Once Wget has been downloaded and installed on your computer, you need to use the command line to work with it.  Regardless of whether you have previous experience using a command line, reading the documentation carefully is important in using Wget because there are a variety of options available for setting the parameters for capturing.  The manual is very heavy with jargon, so plan to spend some time with it in order to better understand the parameters you might use during your web capture.

I first downloaded the software to my computer.  After reading the manual, I opened up a command line interface screen and went to the directory that held Wget.  Once there, I used the following command to run the program, which I will describe in more detail below:

wget -rpxkE -t 20 --limit-rate=100k --wait=2 --directory-prefix=WebsiteCapture2 --level=20 http://www.xyz.org/ -o log1 &

Continue reading »

HTTrack Evaluation

On July 15, 2010, in Research, Software Reviews, by Megan Toups

HTTrack is a free, open source website copier that can be downloaded to your desktop and used to harvest websites. Due to the changing nature of the web, archivists are interested in having a way to take snapshots of websites so that we have a record of what these sites looked like and what information was contained in them. Finding straightforward and cost effective ways of doing this is likely to be an essential part of archival work in the future.

Continue reading »

One of my major preoccupations is evaluating open source software (OSS) and the projects that develop OSS.  For my Fulbright project, I settled on a rough and ready set of evaluation criteria, but some circumstances demand more rigor.  Picking the wrong development framework or library, for example, could fatally wound an OSS development project.To help me and hopefully the Libraries, Archives, and Museum community as a whole) get a better handle on OSS evaluation methods, I wrote a small grant application to the University of Illinois Library’s Research and Publication Committee.

Continue reading »

Tagged with:  

Planets Testbed Review

On May 7, 2010, in Research, Software Reviews, by Chris Prom

Last week, I reviewed the Planets PLATO Preservation Planning tool.  The Testbed is another Planets web-service that can be used in planning preservation services/actions.  Its purpose is to allow users to locate, select, and test services that can be used to undertake preservation actions, such as identification, characterization, and migration.  It is a part of the Planets Interoperatiblity Framework, and should be available for download and local installation after the end of the Planets project, May 30th.  In the meantime, users can register for an account on the public site.

The testbed includes areas to browse services, browse previous experiments, and conduct new experiments.

Continue reading »

Tagged with:  

Review of XENA Normalization Software

On April 22, 2010, in Research, by Chris Prom

Since getting back from our family holiday to Skye, I spent a bit of time working with the Xena (XML Electtronic Normalization for Archives)  tool, which has been developed by the National Archives of Australia.  I had used it (albeit indirectly) in the past, since it was included as the migration engine in the proof of concept version of Archivematica.

Simply put, Xena is a tool that identifies and migrates many common file formats using other open source migration tools (such as FLAC, Imagemagick, and Open Office).  Files that are cannot be converted to an open format undergo ‘binary normalization’, in which the original file is converted to lossless base64 ASCII format. The normalized files can be viewed and exported (or in the case of the binary normalized files, converted back to their original binary format) using a Xena Viewer, which is included in the distribution.  Xena is available under the GPL 3 license.

Based on the testing I’ve done with it, I think Xena could play a role in small archives that want to automate many  file migration tasks.  However, any repository wishing to use it will need to accept the guiding philosophy behind the tool and basically agree that the normalization actions that the tool completes are acceptable (more on that later.)

Continue reading »

Tagged with:  

FITS

On March 19, 2010, in Research, by Chris Prom

At least two people have asked me whether I have been using the FITS tool, developed by Harvard and available on Google Code.  Since I hadn’t used it directly–although I was aware of what it did–I decided to download it and give it a try.

FITS does something that is potentially very useful–if you are able to take some additional steps to integrate it into an e-records processing workflow.

Continue reading »

Tagged with:  

Using SIARD for Database Migration

On March 19, 2010, in Research, by Chris Prom

Using the appraisal tools I discussed last week, I discovered the the OIF files I am working with contain about 150 database files.  Most of these are in Microsoft Access format, although a few are Paradox files.  Using a free Paradox file viewer, I was able to quickly determine that the latter were contact databases, and decided not to undertake an migration or preservation work on them.

Similarly, I examined the 84 access databases included in the accession record and quickly determined that many of them held duplicate information.  Based on an examination of each database, I determined that the vast majority of them containted transactional information (such as order of merchandise relating to Banned Books Week or conference registrations), and did not meet appraisal criteria for permenent retention.  I therefore deleted those.

But one database in particular, had enough evidential and informational value to suggest that it should be prseserved permenantly: a comprehensive database tracking book challenges that have been reported by librarians to the OIF.  While the file is certainly readable using current versions of Microsoft Access, and while I will certainly retain a copy among the final SIP that I am preparing, prudence suggested that a copy should also be generated in a non-proprietary format so that the data at least, if not the look and feel are preserved outside of a depedency on proprietary software.

As I noted in a previous post, SIARD, developed by the Swiss Federal Archives, is one tool that can be used for database normalizaton.  It is platform-independent java tool.  After spending a bit of time working with it, I am impressed by its capabilities, but unfortunately, I ran into repeated and intractible problems in using the program with some large Microsoft Access Databases that used poorly defined schemas and or badly structured data.  In the end, I could not get the software to create a normalized database for the Challenged Books Database. (More on the after the jump.)

Continue reading »

Tagged with: