By Thomas Padilla, graduate research assistant, University of Illinois at Urbana-Champaign

Among the many challenges that archival arrangement and description present, the subjectivity of the archivist is perhaps the most significant. While individual subjectivity cannot be escaped, a distanced form of subjective arrangement and description can be applied by topic modeling collections, a technique that is becoming more common in digital humanities projects.   The University Archives’ application of topic modeling to the electronic files of Carl Woese demonstrates promise as an approach that shortens the time between acquisition and access while providing an alternative imposition of order on and meaning to a collection for scholars to navigate.

Before the test application of topic modeling to the Woese collection began, a number of steps were taken with respect to data extraction and normalization. A disk image was made of a Woese laptop drive using Forensic Toolkit. Disk contents were visualized to gain a sense of the distribution of document filetypes (.doc,.pdf, .txt, etc.). After surveying the file types on the disk, they were gathered and converted to .txt format. Filenames were normalized and files were passed into Topic Modeling Tool.  Iteratively refined topic modeling of the collection was affected by alternating the number of topics and editing the stop list.

Topic Modeling Tool not only discerns topics across the collection and relative strength of topic expression in documents, it also makes presentable how documents in the collection are linked by virtue of shared topic affinity. Topic Modeling Tool generates HTML files that when opened in a web browser display all topics discovered in the collection and links those topics to browsable versions of each document in the collection in rank order. The underlying data are made accessible in comma separated values (CSV) files that provide a list of topics, a list of documents with associated topics, and list of topics with associated documents. With basic modification to the HTML and CSS files, archives can customize the appearance of Topic Modeling Tool output to fit their respective institutions.

Resources to consult when investigating topic modeling as an archival processing tool include the following:

Tagged with: