Table of Contents
Fetching ...

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

Jenna Kanerva, Cassandra Ledins, Siiri Käpyaho, Filip Ginter

TL;DR

The paper tackles OCR post-correction for historical documents using open-weight LLMs, focusing on English and Finnish data to assess scalability for large archives. It systematically investigates generation hyperparameters, quantization, input segmentation, and a novel overgeneration removal technique, along with segment- boundary strategies to maintain text continuity. Results show substantial CER reductions for English (e.g., GPT-4o achieving the largest gains) but fail to achieve practical improvements for Finnish with the tested open-weight models, highlighting language-specific limitations. The work demonstrates the potential and boundaries of prompt-based OCR post-correction, provides actionable guidance for deploying corrections at scale, and releases data and tools to support replication and further research.

Abstract

Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

TL;DR

The paper tackles OCR post-correction for historical documents using open-weight LLMs, focusing on English and Finnish data to assess scalability for large archives. It systematically investigates generation hyperparameters, quantization, input segmentation, and a novel overgeneration removal technique, along with segment- boundary strategies to maintain text continuity. Results show substantial CER reductions for English (e.g., GPT-4o achieving the largest gains) but fail to achieve practical improvements for Finnish with the tested open-weight models, highlighting language-specific limitations. The work demonstrates the potential and boundaries of prompt-based OCR post-correction, provides actionable guidance for deploying corrections at scale, and releases data and tools to support replication and further research.

Abstract

Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.

Paper Structure

This paper contains 16 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Example extracts of texts at two different OCR noise levels from the ECCO dataset of 18th century literature.
  • Figure 2: CER before and after correction on English test data (Llama-3.1-70B).
  • Figure 3: An example in both languages illustrating historical language artifacts alongside the corresponding GPT-4o generated output.
  • Figure 4: CER% improvement for English when using different segment lengths.