Table of Contents
Fetching ...

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

Shuhao Guan, Derek Greene

TL;DR

This study evaluates synthetic-data strategies for post-OCR correction across eight languages, introducing a glyph-similarity-based data construction method alongside four baseline synthetic-data approaches. By comparing multiple pre-trained models (ByT5, mT5, mBART) and glyph-aware variants, it demonstrates substantial CER reductions, with ByT5 achieving the strongest gains (up to ~$48\%$ in certain languages) and glyph-similarity data often performing best when data is scarce. The results show that data-volume augmentation provides benefits with diminishing returns, while glyph-similarity augmentation can outperform traditional noise-injection methods, especially in low-resource contexts, albeit with $O(n^2)$ time for similarity computation. The work highlights the practicality of synthetic data for improving historical OCR outputs and informs future directions for scalable, language-inclusive post-OCR correction.

Abstract

This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

TL;DR

This study evaluates synthetic-data strategies for post-OCR correction across eight languages, introducing a glyph-similarity-based data construction method alongside four baseline synthetic-data approaches. By comparing multiple pre-trained models (ByT5, mT5, mBART) and glyph-aware variants, it demonstrates substantial CER reductions, with ByT5 achieving the strongest gains (up to ~ in certain languages) and glyph-similarity data often performing best when data is scarce. The results show that data-volume augmentation provides benefits with diminishing returns, while glyph-similarity augmentation can outperform traditional noise-injection methods, especially in low-resource contexts, albeit with time for similarity computation. The work highlights the practicality of synthetic data for improving historical OCR outputs and informs future directions for scalable, language-inclusive post-OCR correction.

Abstract

This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.
Paper Structure (20 sections, 2 equations, 5 figures, 4 tables)

This paper contains 20 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Examples of synthetic OCR images in various languages, generated using the process in Section \ref{['sec:methods_2']}.
  • Figure 2: Visualizations of feature matching using ORB for fonts from different character sets, where matched feature points connected by colored lines. For each pair of characters, two numbers are displayed: the upper number represents the Jaccard Index $J$ of overlapping feature points, and the lower number indicates the average distance $D$.
  • Figure 3: Visualization of a glyph similarity matrix for English-language characters (52 letters only). The saturation of each cell represents the value $S_{\text{norm}}(i, j)$ between each pair of characters $i$ and $j$.
  • Figure 4: The structure of the CharBERT post-OCR model incorporates glyph embeddings as inputs. It consists of two CNN encoders and one transformer decoder.
  • Figure 5: Comparative analysis of CER changes after post-OCR across multiple languages. Each bar represents the distribution of CER changes categorized as Increased (red), Decreased (green), Equal (blue), and Zero (dotted green).