This is the third installment in a series of evaluations of website harvesting software on the Practical E-records blog. The first two installments were reviews of the HTTrack open source software and the GNU Wget free utility. This third installment is a review of Heritrix, the Internet Archive’s open source web archiving software.
This is the second installment in a series of evaluations of website harvesting software on the Practical E-records blog. The first installment was an evaluation of the HTTrack open source software and this installment reviews the GNU Wget free utility.
GNU Wget is designed to be used for a variety of different retrievals—HTTP, HTTPS, and FTP— but today I am evaluating it only in light of its use as software for website capture. According to the GNU Wget website the software can be used on any “UNIX-like operating systems as well as Microsoft Windows.” Once Wget has been downloaded and installed on your computer, you need to use the command line to work with it. Regardless of whether you have previous experience using a command line, reading the documentation carefully is important in using Wget because there are a variety of options available for setting the parameters for capturing. The manual is very heavy with jargon, so plan to spend some time with it in order to better understand the parameters you might use during your web capture.
I first downloaded the software to my computer. After reading the manual, I opened up a command line interface screen and went to the directory that held Wget. Once there, I used the following command to run the program, which I will describe in more detail below:
wget -rpxkE -t 20 --limit-rate=100k --wait=2 --directory-prefix=WebsiteCapture2 --level=20 http://www.xyz.org/ -o log1 &
As I noted a few weeks ago, Emily Brock and I are reviewing formal evaluation methods for Open Source Software (OSS). We’re doing this because I would like to get a handle on what worked or didn’t work with the Archon project. Having an objective understanding of that project’s strengths and weaknesses will be critical as the ArchivesSpace project moves forward. The article that Emily and I hope to write will complement Sybil Shaefer’s excellent Code4Lib piece.
The evaluation tools and methods that we found help users select software. While they may may facilitate project improvement or self criticism, that is not their primary purpose. Therefore, Emily Brock and I will be putting together a new method for OSS project evaluation/self criticism, then testing whether it works.
All this is just to say, by way of introduction, that over the next few days, Emily and I will be releasing some posts based on the initial literature review we completed. Before getting to that, Wikipedia contains a helpful overview and comparison of existing open source software assessment methodologies. It lists the Open Source Maturity Model (OSMM) from Navica, the Qualification and Selection of Open Source software (QSOS), and the Open Business Readiness Rating (OpenBRR). Look out soon for more in depth reviews of these, and other methods!
HTTrack is a free, open source website copier that can be downloaded to your desktop and used to harvest websites. Due to the changing nature of the web, archivists are interested in having a way to take snapshots of websites so that we have a record of what these sites looked like and what information was contained in them. Finding straightforward and cost effective ways of doing this is likely to be an essential part of archival work in the future.
At the Practical E-Records Seminar last week, I tried to make the point that if an archives really wants to begin a program for accessioning, preserving, and providing access to born-digital records, the responsible staff member should get involved in an open source software (OSS) project. In my experience, there really is no substitute for taking active part in software development , either as a product tester, contributor, or preferably an active commiter. It will really help you understand how computers and programming work and working on OSS illustrates the challenges of managing even a project. It forces you to quickly separate your wants (“wouldn’t it be great if this software did ‘X'”) from your wants (“. . .but we absolutely need it to do ‘Y'”) and to realize that developing software is all about encouraging people to work together for the common good.
But, which open source software project should you get involved in? I certainly can’t pick your project for you. And listing off specific projects to shame might be a bit unseemly. So, based on my experience over the past ten months (reading documentation for more projects than you care to image and nearly falling into depression over yet another failed installation procedure ), I’d like to offer a nice table of factors you might refer to when considering which software project you might want to get involved with: