PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Shuhao Guan, Moule Lin, Cheng Xu, Xinyi Liu, Jinman Zhao, Jiexin Fan, Qi Xu, Derek Greene
TL;DR
This work tackles accurate digitization of degraded historical documents by introducing PreP-OCR, a two-stage pipeline that first performs document image restoration trained on a richly synthetic degradation dataset and then applies semantic-aware post-OCR correction with ByT5. A key contribution is the synthetic data generation and a multi-directional patch fusion strategy (implemented in ResShift and related models) that enhances large-page restoration, complemented by a ByT5-based post-processing trained on OCR-error distributions. Across 13,831 pages in English, French, and Spanish, the pipeline achieves a substantial reduction in character error rates (CER) by about $63.9$–$70.3 imes$ compared with raw OCR, and demonstrates cross-lingual generalization for Latin-script languages. The findings highlight the practical potential of coupling image restoration with linguistic correction to improve archival digitization, while also revealing challenges with non-Latin scripts and language-model hallucinations in post-processing.
Abstract
This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.
