July 2, 2010

Bill Maher sprung by my desk this morning as excited as a puppy regarding something he found in newly-accessioned records that were discovered in the basement of our law school.  The records, generated by a University committee, document a project to survey student incomes and expenses in the mid to late-1950s.  They include published reports, correspondence, raw survey results, and coding keys for the Illiac (mainframe computer) used to crunch the data (pdf).

Interesting artifacts, the keys now hold relatively little secondary value.  Admittedly, the metadata on them provide some evidence concerning the work of the committee’s members.   Unfortunately,  the punch cards (i.e. data) that they explain were discarded a long time ago, rendering the keys’ informational value nil, from a practical point of view.

Should we care?  In the end, it is very unlikely that many scholars or students would want to reuse the study’s raw data.  But if they were interested, they would like to see more than just the data and its metadata. They would want reports, correspondence, the original surveys (which contain important qualitative information) and other records.  Users would find all of these records intensely valuable for the evidence and information that they contain.

There is a reason why those who stole Phil Jones’ email stole that particular type of record.  While most scientific data is less controversial, no scientific project cannot be understood without adequate records documenting the projects origins, purpose, methods, and results of the project–and it opens itself up to challenge if such records are not retained, or managed properly.  We miss the forest for the trees if we capture only data and the metadata that makes data interpretable.

In this respect, I wonder to what extent projects like the $3.7 million, NSF-funded Data Conservancy Project or other ‘data curation’ efforts are addressing such records?

