Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-Gómez; Tony Montes; Arturo Rodríguez-Herrera; Rubén Manrique

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-Gómez, Tony Montes, Arturo Rodríguez-Herrera, Rubén Manrique

TL;DR

This work tackles the OCR-related hurdles in 19th-century Latin American newspaper texts by introducing LatamXIX, a large, richly sourced dataset augmented with an LLM-based, semi-automated OCR correction framework. The method deploys a diff-based post-OCR correction workflow using GPT-4o-mini to both fix OCR errors and surface linguistic forms, while classifying corrections as surface forms, OCR errors, or hallucinations. Key contributions include the LatamXIX dataset with detailed surface-form and correction inventories, a reproducible correction framework adaptable to other datasets, and insights into historical orthography and surface variation. The approach enables more accurate historical NLP on a region-specific corpus, facilitating linguistic change studies and cultural-historical analyses, albeit with notable challenges around hallucinations and content-policy filtering in the LLM.

Abstract

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

TL;DR

Abstract

Paper Structure (20 sections, 3 figures, 3 tables)

This paper contains 20 sections, 3 figures, 3 tables.

Introduction
Related Work
Sourcing
Processing
Cleaning and filtering
Post-OCR LLM Correction
Corrections Classification
Accent changes
Specific changes
Other letter-to-letter changes
Remaining changes
Results
Future Work
Limitations
Acknowledgements
...and 5 more sections

Figures (3)

Figure 1: El Oso, Peru. An example of a scanned newspaper image. The corresponding OCR-extracted text and the corrected version can be found in Appendix \ref{['sec:appendix1']}, for reference.
Figure 2: Overview of the full methodology pipeline. The blue components correspond to the Layout+OCR stage to get to digitized text, and the orange components correspond to the Post-OCR LLM Correction stage. The two outputs of the pipeline are the LatamXIX Corrected Dataset and the List of Surface Forms. The Custom Layout Model also extracts the images of the newspaper which are then assigned to the related texts (context). The final version of the text has the OCR errors corrected but not the surface forms, as they are part of the language.
Figure C1: LatamXIX dataset decade distribution

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

TL;DR

Abstract

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)