HTTrack Evaluation

On July 15, 2010, in Research, Software Reviews, by Megan Toups

HTTrack is a free, open source website copier that can be downloaded to your desktop and used to harvest websites. Due to the changing nature of the web, archivists are interested in having a way to take snapshots of websites so that we have a record of what these sites looked like and what information was contained in them. Finding straightforward and cost effective ways of doing this is likely to be an essential part of archival work in the future.

Two versions of the software are available–WinHTTrack and WebHTTrack. The first is for use on Windows computers and the second runs on a Linux/Unix/BSD platform. I used the WinHTTrack to do a basic website harvesting for an organization’s website.

Installing the software was very simple; it required basic downloading from the HTTrack website, then minimal little configuration during setup of the harvesting program. I’ll walk you through the basic steps of setting up a capture to show you how straightforward it is.

After the program has been installed, you open the program and begin a new project:

Choose a project name and a location for the saved file. In my case, I created a folder on the desktop to hold my website captures in order to keep them in a central location while getting familiar with the program. After naming the capture and designating a location, the next step is telling the program what website(s) you want it to capture.

In the web addresses box you just type in the name of the address you want to harvest. The only trouble I had here was that in the HTTrack tutorial, the example screenshot shows a backslash at the end of the web address (such as e-records.chrisprom.com/), but when I did that nothing was captured. This is just a minor issue with their tutorial/documentation and it was easily remedied by deleting the back slash in the web address box.

The other important thing at this step is to adjust the “Preferences and minor options” by clicking the “Set Options” button. This is where you can adjust parameters such as how many connections to maintain to the site at once or the depth at which the crawler should go. This is fairly straightforward if you know some of the background regarding website capture, such as the excellent book by Adrian Brown. Even a basic knowledge (which is all I have) of the process—that your server is going to connect to the desired website’s server, that crawlers will follow the links and mirror the site—gives you enough information to adjust basic preferences to avoid crashing the desired site (by having too many connections) or capturing a prohibitively large portion of the web (by not restricting the capture enough).

The main parameters that I worked with were under the headings Flow Control and Links. Different projects have different needs, but I wanted to only adjust the number of connections to the site to be captured (under Flow Control) to 1 because I didn’t want to flood their system (I think the max recommended is around 4 connections, but I wanted to play it safe).

The other parameters I wanted to adjust were under the Links tab. I checked the boxes for “attempt to detect all links”, “get non-HTML files related to a link”, and “get HTML files first”. I wanted to make sure to capture the website, but also other related files that might be necessary for it to display properly.

Another tab that is important is the Limits tab although in my case I did not need to adjust anything. Here you can adjust things such as how far down you want the mirroring to go. This is important if you are trying to grab a larger domain and do not want to have the crawler trying to grab the entire web. For my purposes the Limits tab was not an issue as I was already restricting my capture to a single web domain and I know that it is not prohibitively large.

Once you click “OK” to leave the options menus and then “Next” to go move on, you just choose to connect to the server and click “Finish” and the capture begins! So, at least for the purposes of a general capture of a single site, configuration just takes a minute or two. Only having one connection at a time to the site I was attempting to capture meant that the capture a little over a day to harvest, but it runs in the background on your computer and it can be left overnight.

In looking through my captured site using the program Opera, the basic overall look and feel of the site has been captured as well as the vast majority of the content. There are some smaller items, such as video, that do not appear to have been captured, but it is too early for me to tell whether or not this is due to some parameter I did not adjust properly or whether the content is stored offsite somewhere inaccessible to the crawler. For a first run capture, the mirrored site looks pretty good which I think speaks to the good design of the HTTrack program. For a beginner to be able to capture a site relatively easy, quickly, and well speaks highly of the program. HTTrack seems overall to be a very easy and useful tool, at least for small archives trying to capture a limited number of websites.

Evaluation Criteria:

  • Installation/Configuration/Supported Platforms: Installation was very quick and easy and configuration was fairly straightforward (although if someone had absolutely no knowledge of website crawling they might be lost). It runs a couple of popular platforms—Windows and Linux—which make it ideal for wide use. There is not a specific Mac version, but it seems likely that it could be run with a Windows emulator on a Mac or it may run from the termial in mac (we did not test this). 18/20
  • Functionality/Reliability: In a number of test runs and in the final website capture there were no problems with it crashing or freezing. On initial look through the captured site, it fairly reliably captured most items. 19/20
  • Usability: It was fairly straightforward to use. Although there are options to adjust you are not overwhelmed with them on the first screen. Some parameters even have a note to say not to adjust them unless you are very familiar with the process. 9/10
  • Scalability: I’m not sure how this would work since I only ran it for one website. It seems easy enough to use to capture larger sites because you can tailor the extent of capture. However, it seems to be designed for smaller captures since everything will be put into the same folder even if you are capturing multiple sites. On the plus side, it there is a command line version, so ti could be integrated into other processes or applications, such as Archivematica, by a skilled programmer. 6/10
  • Documentation: The tutorial on their website is very easy to follow and helpful in running the program for the first time. There are accompanying screenshots which lead you through each step and you can explore the options further there as well. In addition there is a FAQs section and a forum as well as more extensive information if you are more technical savvy. The documentation has a nice balance of being useful for beginners, but with more information to explore if you are more experienced. The only problem was mentioned above–the tutorial shows a backslash in the web address box when in practice the backslash appears to actually prohibit capture. 9/10
  • Interoperability/Metadata support: There is no metadata support that I could find. The program is designed to capture the site, but does not provide additional areas to add metadata surrounding the capture. While it would be nice to have this with the program, it doesn’t seem necessary to the function of the program (even though it is necessary to its use in archives).  Most likely, you would be adding ‘collection level’ metadata for the harvested site in an archival descriptive system.  6/10
  • Flexibility/Customizability: The parameters one sets for the website capture are customizable, so it isn’t a one size fits all capture. You can choose to include/exclude certain types of files, decide how far down into the web you want the crawler to go, and how long you want it to try to connect to a site, among other options. Regarding output flexibility: there does not seem to be a whole lot of customizability, but that doesn’t seem necessary as in the end you are usually just interested in getting a mirrored site as close to the original as possible. 9/10
  • License/Support/Sustainability/Community: There is an active forum on the website where people ask questions and troubleshoot issues and the software is open source. There is also contact information for the creators. This seems like a pretty active and robust community, although with any software there is always the danger it may go out of popular use. 9/10

Final Score: 85/100

Bottom Line: A very straightforward and easy program to use for small scale website archiving projects. Takes very minimal setup and general knowledge and can produce a captured website relatively quickly. It is recommended highly for anyone interested in capturing websites (even beginners).

  • phirtle

    So how does this compare to Heritrix, the industry-standard web-archiving software used by the Internet Archive, bunches of national libraries, and Columbia and Harvard for their web archiving operations? Why would you want to use this instead?

  • Chris Prom

    Peter,

    We are going to be trying Heritrix out as well, but I believe based on my initial look that Heritrix is most suitable for operations (such as those you cited) that have some higher level of tech support, not the smaller archives that this blog focuses on. So, we started with the one I felt to be most likely to be useful to those who don’t have tech support, and will review Heritrix later.

  • Seth

    I agree w/ Chris’ initial assessment that Heritrix is better suited to the technically savvy (read: comfortable with administering a linux box) or those with access to good tech-support.

    One thing to note about the HTTrack in terms of authenticity & preservation is to be sure you keep the ht-cache. This file keeps the original downloaded bits & header data while the browse version has been modified to be viewed locally.
    Be sure under “Index, Log, Cache” to select “Store ALL files in cache” so you don’t inadvertently loose something you wanted to keep. I would consider the cache my “preservation copy” and the browse-able HTML my “access copy.”

  • Pingback: Heritrix Evaluation « Practical E-Records()

  • Ben

    The note about ht-cache is very helpful, but I am wondering how critical the lack of metadata support is? Capturing such large volumes of files without any automated metadata support concerns me. Also, Seth, I am curious to know if you’re attempting checksums on the cache/preservation copies.

  • Chris Prom

    Metadata support is probably not critical, I would think we will just put a descriptive record in our catalog entry for the entire site we capture to provide basic descriptive metadata, record date of capture, etc. What I’d like to do next is run the captured resource thru something like the data accessioner to capture checksums and get minimal identifying and characterization information.

  • Ben

    That’s a great idea on using the DataAccessioner to capture some of the pres-metdata and checksums. I will do that…