Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
Shuhao Guan, Derek Greene
TL;DR
This study evaluates synthetic-data strategies for post-OCR correction across eight languages, introducing a glyph-similarity-based data construction method alongside four baseline synthetic-data approaches. By comparing multiple pre-trained models (ByT5, mT5, mBART) and glyph-aware variants, it demonstrates substantial CER reductions, with ByT5 achieving the strongest gains (up to ~$48\%$ in certain languages) and glyph-similarity data often performing best when data is scarce. The results show that data-volume augmentation provides benefits with diminishing returns, while glyph-similarity augmentation can outperform traditional noise-injection methods, especially in low-resource contexts, albeit with $O(n^2)$ time for similarity computation. The work highlights the practicality of synthetic data for improving historical OCR outputs and informs future directions for scalable, language-inclusive post-OCR correction.
Abstract
This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.
