Table of Contents
Fetching ...

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Harshvivek Kashid, Pushpak Bhattacharyya

TL;DR

RoundTripOCR addresses data scarcity for post-OCR error correction in low-resource Devanagari languages by generating synthetic <Text T, OCR Output T'> data through multi-font rendering and OCR simulation, enabling large-scale training. It treats OCR errors as mistranslations within an automatic post-editing framework and trains transformer-based seq2seq models on six languages (Hindi, Marathi, Bodo, Nepali, Konkani, Sanskrit). Experiments show that pretrained models such as $m$BART (all fonts) achieve substantial reductions in $CER$ and $WER$ compared with raw OCR output, including significant improvements on Sanskrit and unseen test samples. The study emphasizes the importance of font diversity and synthetic data augmentation and points to future work on language-adaptive OCR corrections and broader multilingual applicability.

Abstract

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

TL;DR

RoundTripOCR addresses data scarcity for post-OCR error correction in low-resource Devanagari languages by generating synthetic <Text T, OCR Output T'> data through multi-font rendering and OCR simulation, enabling large-scale training. It treats OCR errors as mistranslations within an automatic post-editing framework and trains transformer-based seq2seq models on six languages (Hindi, Marathi, Bodo, Nepali, Konkani, Sanskrit). Experiments show that pretrained models such as BART (all fonts) achieve substantial reductions in and compared with raw OCR output, including significant improvements on Sanskrit and unseen test samples. The study emphasizes the importance of font diversity and synthetic data augmentation and points to future work on language-adaptive OCR corrections and broader multilingual applicability.

Abstract

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Vowels, modifiers and consonants of Devanagari script.
  • Figure 2: RoundTripOCR: Artificial post-OCR error correction data generation process. We get <Text T, Image I, OCR output T'> as output, where <Text T> will be used as corrected OCR output text and <OCR output T'> as OCR output.
  • Figure 3: Examples of images generated with different fonts during RoundTripOCR data generation process.
  • Figure 4: Comparision of different fonts and their CER in the Hindi test dataset.