“Machine Reading the Primeros Libros.” Digital Humanities Quarterly. November 2016.
Early modern printed books pose particular challenges for automatic transcription: uneven inking, irregular orthographies, radically multilingual texts. As a result, modern efforts to transcribe these documents tend to produce the textual gibberish commonly known as “dirty OCR” (Optical Character Recognition). This noisy output is most frequently seen as a barrier to access for scholars interested in the computational analysis or digital display of transcribed documents. This article, however, proposes that a closer analysis of dirty OCR can reveal both historical and cultural factors at play in the practice of automatic transcription. To make this argument, it focuses on tools developed for the automatic transcription of the Primeros Libros collection of sixteenth century Mexican printed books. By bringing together the history of the collection with that of the OCR tool, it illustrates how the colonial history of these documents is embedded in, and transformed by, the statistical models used for automatic transcription. It argues that automatic transcription, itself a mechanical and practical tool, also has an interpretive effect on transcribed texts that can have practical consequences for scholarly work.