Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-Gómez, Tony Montes, Arturo Rodríguez-Herrera, Rubén Manrique
TL;DR
This work tackles the OCR-related hurdles in 19th-century Latin American newspaper texts by introducing LatamXIX, a large, richly sourced dataset augmented with an LLM-based, semi-automated OCR correction framework. The method deploys a diff-based post-OCR correction workflow using GPT-4o-mini to both fix OCR errors and surface linguistic forms, while classifying corrections as surface forms, OCR errors, or hallucinations. Key contributions include the LatamXIX dataset with detailed surface-form and correction inventories, a reproducible correction framework adaptable to other datasets, and insights into historical orthography and surface variation. The approach enables more accurate historical NLP on a region-specific corpus, facilitating linguistic change studies and cultural-historical analyses, albeit with notable challenges around hallucinations and content-policy filtering in the LLM.
Abstract
This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.
