Table of Contents
Fetching ...

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

Jonathan Bourne

TL;DR

This work investigates post-OCR correction for historical newspapers using transformer language models. It introduces CLOCR-C, which leverages infilling, socio-cultural prompts, and Task-In-Context Learning to reconstruct likely original text, evaluated on the NCSE and Overproof datasets with CER and ERP metrics, and assessed for downstream NER via CoNES. The results show top models (e.g., GPT-4, Claude Opus) can reduce CER by >60% on NCSE and improve entity recovery; socio-cultural context boosts performance while misleading prompts harm, and a transcribed NCSE dataset is released to support further research. The findings suggest CLOCR-C is a promising, practically impactful approach for enhancing digital archives, while underscoring the need for affordable open-source models and methods to predict recoverability of highly corrupted text.

Abstract

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60\% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models

TL;DR

This work investigates post-OCR correction for historical newspapers using transformer language models. It introduces CLOCR-C, which leverages infilling, socio-cultural prompts, and Task-In-Context Learning to reconstruct likely original text, evaluated on the NCSE and Overproof datasets with CER and ERP metrics, and assessed for downstream NER via CoNES. The results show top models (e.g., GPT-4, Claude Opus) can reduce CER by >60% on NCSE and improve entity recovery; socio-cultural context boosts performance while misleading prompts harm, and a transcribed NCSE dataset is released to support further research. The findings suggest CLOCR-C is a promising, practically impactful approach for enhancing digital archives, while underscoring the need for affordable open-source models and methods to predict recoverability of highly corrupted text.

Abstract

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs on three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60\% reduction in character error rate on the NCSE dataset. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. In addition to the findings, this study releases a dataset of 91 transcribed articles from the NCSE, containing a total of 40 thousand words, to support further research in this area. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.
Paper Structure (23 sections, 3 equations, 5 figures, 18 tables)

This paper contains 23 sections, 3 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Relationship between the original and corrected CER using the Opus model. As the original CER gets increases so does the average corrected value. All texts below the red line have been improved by the CLOCR-C process.
  • Figure 2: The figures show that providing socio-cultural context in the prompt dramatically increases task performance.
  • Figure 3: The overall results are not clear with system prompts working sometimes and not others, and prompts working well for some models but not others
  • Figure 4: The difference between the instruct prompt and the full prompt is different across models and datasets showing no clear picture.
  • Figure 5: The relationship between CoNES and F1 improvement is substantial, with every model having a higher CoNES improvement than F1 improvement.