Table of Contents
Fetching ...

Scrambled text: training Language Models to correct OCR errors using synthetic data

Jonathan Bourne

TL;DR

It is shown that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors.

Abstract

OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and the broader socio-cultural context, a process called Context Leveraging OCR Correction (CLOCR-C). However, getting sufficient training data for fine-tuning such models can prove challenging. This paper shows that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors. Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data. Key findings include; training on under-corrupted data is better than over-corrupted data; non-uniform character level corruption is better than uniform corruption; More tokens-per-observation outperforms more observations for a fixed token budget. The outputs for this paper are a set of 8 heuristics for training effective CLOCR-C models, a dataset of 11,000 synthetic 19th century newspaper articles and scrambledtext a python library for creating synthetic corrupted data.

Scrambled text: training Language Models to correct OCR errors using synthetic data

TL;DR

It is shown that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors.

Abstract

OCR errors are common in digitised historical archives significantly affecting their usability and value. Generative Language Models (LMs) have shown potential for correcting these errors using the context provided by the corrupted text and the broader socio-cultural context, a process called Context Leveraging OCR Correction (CLOCR-C). However, getting sufficient training data for fine-tuning such models can prove challenging. This paper shows that fine-tuning a language model on synthetic data using an LM and using a character level Markov corruption process can significantly improve the ability to correct OCR errors. Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data. Key findings include; training on under-corrupted data is better than over-corrupted data; non-uniform character level corruption is better than uniform corruption; More tokens-per-observation outperforms more observations for a fixed token budget. The outputs for this paper are a set of 8 heuristics for training effective CLOCR-C models, a dataset of 11,000 synthetic 19th century newspaper articles and scrambledtext a python library for creating synthetic corrupted data.
Paper Structure (20 sections, 8 equations, 6 figures, 5 tables)

This paper contains 20 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: How the elements of the project hang together can be visualised as above. In this diagram the data is shown in blue, the processes in pink and the computational tools such as scrambledtext and Llama as green
  • Figure 2: The corruption network is applied at character level to the text; the conditional transition probabilities between the nodes are learnt on a per character basis from parallel OCR and ground truth texts
  • Figure 3: Solid red lines show the performance of the baseline Llama3 model, and the dashed red line shows the median CER of the NCSE dataset. Across the whole NCSE dataset, most models outperformed the baseline Llama3. Still, few managed to reduce the CER compared to the NCSE average of 0.17.
  • Figure 4: The figure shows the contrast in performance when looking at only the High (CER$>$0.17) and low (CER$\leq$0.17) corruption. The red lines show the performance of the base Llama model.
  • Figure 5: The figures shown here illustrate the interaction between the number of tokens per observation and the total number of tokens in the whole training set. Increasing the number of tokens in the dataset improves model performance. However, the impact of the total number of tokens can be difficult given the overall level of noise
  • ...and 1 more figures