Table of Contents
Fetching ...

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-Gómez, Tony Montes, Arturo Rodríguez-Herrera, Rubén Manrique

TL;DR

This work tackles the OCR-related hurdles in 19th-century Latin American newspaper texts by introducing LatamXIX, a large, richly sourced dataset augmented with an LLM-based, semi-automated OCR correction framework. The method deploys a diff-based post-OCR correction workflow using GPT-4o-mini to both fix OCR errors and surface linguistic forms, while classifying corrections as surface forms, OCR errors, or hallucinations. Key contributions include the LatamXIX dataset with detailed surface-form and correction inventories, a reproducible correction framework adaptable to other datasets, and insights into historical orthography and surface variation. The approach enables more accurate historical NLP on a region-specific corpus, facilitating linguistic change studies and cultural-historical analyses, albeit with notable challenges around hallucinations and content-policy filtering in the LLM.

Abstract

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

TL;DR

This work tackles the OCR-related hurdles in 19th-century Latin American newspaper texts by introducing LatamXIX, a large, richly sourced dataset augmented with an LLM-based, semi-automated OCR correction framework. The method deploys a diff-based post-OCR correction workflow using GPT-4o-mini to both fix OCR errors and surface linguistic forms, while classifying corrections as surface forms, OCR errors, or hallucinations. Key contributions include the LatamXIX dataset with detailed surface-form and correction inventories, a reproducible correction framework adaptable to other datasets, and insights into historical orthography and surface variation. The approach enables more accurate historical NLP on a region-specific corpus, facilitating linguistic change studies and cultural-historical analyses, albeit with notable challenges around hallucinations and content-policy filtering in the LLM.

Abstract

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.
Paper Structure (20 sections, 3 figures, 3 tables)

This paper contains 20 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: El Oso, Peru. An example of a scanned newspaper image. The corresponding OCR-extracted text and the corrected version can be found in Appendix \ref{['sec:appendix1']}, for reference.
  • Figure 2: Overview of the full methodology pipeline. The blue components correspond to the Layout+OCR stage to get to digitized text, and the orange components correspond to the Post-OCR LLM Correction stage. The two outputs of the pipeline are the LatamXIX Corrected Dataset and the List of Surface Forms. The Custom Layout Model also extracts the images of the newspaper which are then assigned to the related texts (context). The final version of the text has the OCR errors corrected but not the surface forms, as they are part of the language.
  • Figure C1: LatamXIX dataset decade distribution