This is the second installment in a series of evaluations of website harvesting software on the Practical E-records blog. The first installment was an evaluation of the HTTrack open source software and this installment reviews the GNU Wget free utility.
GNU Wget is designed to be used for a variety of different retrievals—HTTP, HTTPS, and FTP— but today I am evaluating it only in light of its use as software for website capture. According to the GNU Wget website the software can be used on any “UNIX-like operating systems as well as Microsoft Windows.” Once Wget has been downloaded and installed on your computer, you need to use the command line to work with it. Regardless of whether you have previous experience using a command line, reading the documentation carefully is important in using Wget because there are a variety of options available for setting the parameters for capturing. The manual is very heavy with jargon, so plan to spend some time with it in order to better understand the parameters you might use during your web capture.
I first downloaded the software to my computer. After reading the manual, I opened up a command line interface screen and went to the directory that held Wget. Once there, I used the following command to run the program, which I will describe in more detail below:
wget -rpxkE -t 20 --limit-rate=100k --wait=2 --directory-prefix=WebsiteCapture2 --level=20 http://www.xyz.org/ -o log1 &
wget= the program to use the following commands
r=specify recursive download
p=get all images, etc. needed to display HTML page
x=force creation of directories
k=make links in downloaded HTML point to local files
E=save HTML documents with ‘.html’ extension
t=set number of retries
limit-rate=limit download rate to number specified (rate/second)
wait=number of seconds to wait between retrievals
directory-prefix=save files to prefix (name specified after equal sign)
level=maximum recursion depth
URL=replace your own URL for capture; here I am just using example letters, not the actual site I captured
o=log messages to FILE (name specified after)
&=makes Wget run on the background of your computer
Many of these parameters were set up to ensure I was not taxing my host’s server or downloading the entire internet—level, t, limit-rate, and wait. The other parameters were mainly to configure what I wanted the end product to be—E, r, p, x, k, directory-prefix, o. And the & was there to make sure I could do other activities on my computer while the program ran.
Unfortunately, this capture produced a problem. If I clicked on links in my downloaded website, they pointed to nonexistent pages. I soon realized with the help of Chris that this was because I had told the program to append an HTML to the end of links (E command), but to also look for the local link before looking for an outside link. In other words the program had created extensions that looked like this: aspx.html and then could not link to the appropriate page. To fix this I executed a command for web capture without the “html extension” (E command) command. My command for capture then looked like this:
wget -rpxk -t 20 --limit-rate=100k --wait=2 --directory-prefix=WebsiteCapture3 --level=20 http://www.xyz.org/ -o log2 &
This fixed the issue I was having with the first capture. On an initial look through the mirrored site, it appears to have captured much of the textual information on the site and the general look and feel of navigating through the pages. It did not adequately capture some of the more complicated content, like images linked out or pages that have more complicated presentation issues. However, for a first try with Wget without optimizing, it looks to have done a decent job in capturing much of the desired content. A novice to command line interfaces and website capturing can use Wget, but it is definitely important to become familiar with the documentation before implementing a capture.
- Installation/Configuration/Supported Platforms: Installation was fairly straightforward and quick and Wget is available on a variety of popular platforms. It can easily be installed under Windows, but it took some time to find the right download packaged as an msi installer. However, configuration takes some time as it is first necessary to become very familiar with the documentation before being able to configure (or at least it is for a novice). 16/20
- Functionality/Reliability: Once the program was set up to run, it ran without crashing or freezing and completed in a timely manner. On an initial look through the captured site it looked like it capture a majority of the written content (though not all) and some of the images. 18/20
- Usability: Easy to use once you know what parameters you want to use and takes some time getting familiar with working at the command line, but it would benefit from a user friendly interface. It is not straightforward to execute a capture until you have spent a fair amount of time with the documentation. 6/10
- Scalability: Designed to be used for either capture of specific content or for entire websites, so can scale from medium sized capture to something small (like the images off of one page only). I did not test it for large scale captures (of multiple sites), but because of the command line it seems likely a skilled programmer could use it to do large scale captures. Since it can be run from the command line, it could easily be integrated into other tools, such as Archivematica. 9/10
- Documentation: There is a manual, FAQ, mailing list, and wiki, which is good. The manual is extensive and describes in depth each command available. However the manual is very confusing at first if you have no command line experience. Spending time with the manual is highly recommended. In addition I had trouble finding a comprehensive list of available commands until I was already in the command line. It would be helpful to have this condensed list available in the manual. In addition the “& command” (which allows the program to run in the background on your computer) wasn’t mentioned anywhere except in the examples section of the manual which seems like a major oversight. The examples section itself was not terribly helpful to me in running the program other than showing me how to make the program run in the background. 7/10
- Interoperability/Metadata support: Similar to HTTrack, there is no metadata support that I could find. The program is designed to capture the site, but does not provide additional areas to add metadata surrounding the capture. While it would be nice to have this with the program, it doesn’t seem necessary to the function of the program (even though it is necessary to its use in archives). Most likely, you would be adding ‘collection level’ metadata for the harvested site in an archival descriptive system. 6/10
- Flexibility/Customizability: This seems to be one of Wget’s strengths. Because of the command line interface, you can adjust a large variety of parameters (all detailed in depth in the manual). This flexibility is definitely a boon if you know what you are doing, but can also be a bit intimidating for a first time user; finally, it can also be integrated with other tools, and fits in with the principles of service oriented architecture. For example, it would be possible to write a graphical front end for it. 10/10
- License/Support/Sustainability/Community: Wget is free software under the GPL Version 3. Support is available through the mailing list, although a forum might be of more use so that people can look for others who have had similar issues. It is hard to tell without that how popular this program is and whether it will have sustained support although it is part of the the GNU Operating System project. 9/10
Final Score: 81/100
Bottom Line: A good program for small scale website archiving projects, but requires more time than HTTrack to get familiar with the particular documentation and with the command line interface, at least for novices.