Other Windows Utilities for Appraisal

On March 15, 2010, in Research, by Chris Prom

When I was working with file managers to conduct some appraisal on the records in my test set–see my postings last week–there were several instances in which it seemed likely to me that more specialized tools would help me identify groups of files for deletion, migration, or other processing actions.  I needed to study the structure of a complex folder, identify duplicate files, move files based on filter criteria, and rename files, for instance.  Even the most complex generic file management program cannot be used to complete those operations.1

Many of these tricks can be accomplished using command line functions in Linux or other operating system.  But, realistically, most archivists will have access to a Windows computer, so I started looking around for free or low cost programs that run on Windows,   After the break, you can read about three tools I found particularly useful: Tree Size, Duplicate Cleaner and Renamer.  Even though their names make them sound prosaic at best, they really are worth knowing about.

[updated April 22, 2010: added detailed ratings]

Tree Size

TreeSize can help you get a quick understanding of the types and relative size for files within a large nested folder structure.  Essentially, it profiles an entire drive (or a subset of the drive). For some uses, the free version would be adequate, but it is important to note that most of the graphical and charting features that are shown below are not available in the free version, nor does it provide a listing of files by file type.  In addition, the free version only works on the local (“C”) drive, not on networked drives.  Therefore, if you are going to be appraising a lot of records, it may be worth spending $49.95 for the full version.  However, the Pro version does provide 30 days of functionality before payment is required, so there is plenty of time to test it out.

Here is a screenshot of the Pro version, after having run on my OIF files:


Obviously, the application helps you get a quick handle on which folders and subfolders are taking up the most space.  The results can be viewed in different ways (using the tabs) and easily drilled.  All files of a particular extension or generic class can be listed, exported, and shown in a windows explorer window, to allow further manipulation, as necessary (although the viewing function is extraordinarily slow to load for a networked drive or shared folder, at least on the copy I am running on VirtualBox.)In particular, one can check files, then move, compress or delete them as wanted.  Any of the file lists produced could also be saved to a text file.  It is not hard to imagine reworking the text file a bit, then passing the file names into a conversion program to migrate all files of a given type, wherever they lie in the file hierarchy.

Another useful features of the application is the ability to save the output to an xml file–an xsl file is supplied to style the output correctly in the browser.  Here is my ranking of the application, overall, I give it 80/100 for its utility to a ‘small’ archives processing e- records:

Treesize: ‘Score’ for a Small Archives

  • Installation/confuguration/supported platforms: 17/20 windows only
  • Functionality/Reliability: 20/20
  • Usability: 8/10
  • Scalability: 10/10
  • Documentation: 10/10
  • Interoperability/Metadata support: 5/10
  • Flexibilty/Customizability: 5/10
  • License/Support/Sustainabilty/Community: 5/10 paid software

Final: 80/100

Duplicate Cleaner

I looked briefly at several files for finding and removing duplicate files, and Duplicate Cleaner was the best of the lot.  This free software installs quickly and doesn’t have any annoying ads.  Although there are a few interface quirks (for example, you have to look under a menu to tell the application to search through subfolders), the program allows you to quickly identify duplicates based on content of the files (the application calculates an md5 checksum for the contents of each file, and compares those values) or other parameters.  The results are presented in an understandable, actionable format, with all of the duplicate files arranged into groups.

While it is possible to go through the entire list to manually check the files you wish to remove, Duplicate Cleaner also includes a ‘Selection Assistant’ which allows you to define parameters (such as path name) for files that should be marked for deletion, while preserving one file in each group.  It would be foolish to rely totally on the ‘assistant’s’ advice.  More likely, you can use it to create an initial marking, then review the marked files and make changes before removing the files.  It would be ideal if a few more filtering options were provided, or if you could choose to negate some of the values provided (e.g. don’t mark any files for deletion in a particular folder and subfolders)–but this seems like an idle complaint since the software is free. Overall the application is extremely useful, and I was able to identify over 5 GB of duplicate files for potential removal.

Duplicate Cleaner

From an archival point of view, this program has two very essential features.  First, any files to be deleted can be moved, instead of deleted, and the original file structure can be maintained in the moved copy.  If the archives decides to keep the moved copies (perhaps outside of the dissemination information packet or in an offline storage system, you will retain the ability to recover a file or reconstruct the original order of the files, if it ever becomes necessary.)  An even more useful feature is the ability to export CSV lists of files.  By exporting a list before and after deletion, then including the list in the metadata that accompanies the Submission or Archival Information Packet, a complete record of deleted or moved files can be retained.  That is very important, because as Alexandra Eveleigh noted in a comment on this blog the other day, the real problem in working to appraise records is not so much making decisions, as it is recording what has been done.

After working with the software for awhile, it became apparent to me that it would be most useful after quite a bit of initial arrangement and weeding had been completed.  Once the overall set of files has been analyzed using tools like Tree Size, and after files have been arranged into a preliminary order, it would be much easier to make deletion decisions.  Therefore, I only deleted several files during my initial use of the software, and will return to use it more, when I actually process the files.

Overall, I’d give the software 75/100 points on my evaluation scale for utility in a ‘small’ archives e-record program.

Duplicate Cleaner: ‘Score’ for a Small Archives

  • Installation/confuguration/supported platforms: 17/20 windows only.
  • Functionality/Reliability: 15/20. Can be a bit slow at times.
  • Usability: 8/10 a few usability quirks
  • Scalability: 5/10
  • Documentation: 8/10
  • Interoperability/Metadata support: 8/10
  • Flexibilty/Customizability: 7/10
  • License/Support/Sustainabilty/Community: 5/10 paid software

Final: 73/100


In working with the files from my Office of Intellectual Freedom Accession, I noticed that (following standard practice at the time), every file created by WordPerfect 5.1 in the 1990 used non-standard file extensions (n0t .wpd or something meaningful).  DROID identified some, but not all of these file based int header analysis.  In order to most easily convert these files to a more common format, they needed to be segregated from the existing records.  The all had one things in common: use of ALL CAPS.

One would like to think that some kind of file copy tool, would be available to segregate these files out from the rest.  However, TeraCopy or Copy Handler don’t allow filtered copies using regular expressions–which was what I really needed.  The copy tool included in the xplorer2 does, but the copying process is very slow when using them (it was counting down on 50 hours, in real time), so I aborted that process very quickly.

Therefore, once I was honest with myself, I had to admit that  the original order (or even names) of the files that are part of this accession will need to be changed if I really want to process them properly using a tool other than XENA, which has an integrated ‘file type guesser’. One would like to think that this would not be necessary, but at least in my case, I have found it to be unavoidable, since I found that I needed to make copies of all files in a separate location so they could be migrated using existing tools (rather than some kind of custom script.)

By using a renaming tool which includes support for regular expressions, all files using the ALL CAPS syntax can have the .wpd extension concatenated on the end.  This will facilitate additional handling, such as identification during a conversion routine, copying to another location for migration using the simple text based filters in Copy Handler, or the deletion of all files NOT corresponding to the new file extension (using a tool like Deletor, for instance.)

The best windows renaming tool I have used is Renamer, a free tool available under a creative commons attribution/no derivative works license(Thunar provides additional functionality in Linux XFCE and Ubuntu.  Renamer allows you to quickly and easily rename large groups of files, based on a large number of input and export parameters.  It includes support for regular expressions and includes a short, well written manual.  The program operates quickly, and the developer is in the process of preparing an updated version of the software.   Usefully, the program previews the results before giving you the option to actually complete the renaming operation, so that you can preview the effect of your filters. Although the validation process takes a bit of time (several minutes when I applied a regular expression to about 22,000 files), it is better than trying the validate the actual results of a failed renaming operation.  You can, of course, considerably speed things up by previewing changes ona small subset of files, containing a range of expected file names, that have been placed into a subfolder, and only doing a preview on the actual working files once you are reasonably certain it works.

Here is a screenshot of me previewing a process to appending .wpd onto the end of all files that are in a form of the ALLCAPS 8.3 syntax, prior to running a batch conversion process to migrate the files to Open Office format and PDF/A):


Once I actually had it rename the files, it completed the operation in about one minutes, and 36 files did not rename, because renaming them would have caused them to overwrite an existing files.  Upon closer examination, these files were all duplicates, anyway.2

As I said, I found this application really useful.  It even includes the ability to save conversion routines as presets, and to run the program from the command line, using the preset either on a list of files or on a existing folder.  The developer behind this project has really put time into creating a program that does what it says it will, without fluff, and is targeted to real needs.

There is one caveat, however, which as nothing to do with the program itself: renaming the original files will cause the ‘date last modified’ for that file to be updated.  Therefore, if you plan to keep the original copy of the file in a way that fully preserves its context and provenance, this tool should only be used on copies of files that are being prepared for some other action–such as migration.  Otherwise, you will need to find a way to capture the original modified date into some form of reference information/file that is linked to the original and migrated objects.  It is possible to save some file conversion logs, since the results of a rename operation, showing all ‘before and after’ names and paths, can be saved to a csv file. But date modified is not one of the file properties saved to the output  In fact, I’m tempted to say that if renamer included a way to write it to the export data, it would be a near perfect application for e-records processing, since then you could easily reuse the data, or at least include it in the AIP.

Overall ‘Score’ for usefulness to small archives: 95/100

Renamer: ‘Score’ for a Small Archives

  • Installation/configuration/supported platforms: 17/20 windows only
  • Functionality/Reliability: 20/20
  • Usability: 10/10
  • Scalability: 10/10 zips through a huge number of files
  • Documentation: 8/10
  • Interoperability/Metadata support: 5/10
  • Flexibility/Customizability: 8/10
  • License/Support/Sustainability/Community: 9/10 freeware but you can’t modify it

Final: 87/100

1. As I’ve been working with these tools, its become more and more apparent to me how useful they might be at various stages in a processing workflow. However, I think it is safe to say that little, if any research or development work in the academic community is being placed into developing the type of file management toolkit that would actually assist with the processing (as opposed to the preservation) of electronic records. Tools like RODA and Archivematica (which are already a step ahead of anything else) will require a reasonably well formed and pre-processed body of records. The PLANETS tools assume that that you are conducting preservation planning or testing on a relatively homogeneous set of records. But when confronted with the messy reality of a producer’s hard drive or shared file space, it is apparent that much work needs to be done before materials are ready to be processed.

That is where, I think, tools like the file managers I mentioned last week, and these utilities come in. That they will need to be used is confirmed by at least two studies working with real life electronic records (Durden, Yale study). Yet a real challenge may remain in capturing the results of appraisal or processing decisions made while using these tools–even the exact type of data (and at what level–collection, series, file, item?) is undefined.

2. For example, using my script, a file named TEST” became TEST.WPD, conflicting with the current TEST.WPD file.

Tagged with:  
  • binaryman

    Directory Report does everything you require:
    Study the structure of a complex folder
    Identify duplicate files,
    Move files based on filter criteria,
    Rename files