Table of Contents
Fetching ...

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Shuhao Guan, Moule Lin, Cheng Xu, Xinyi Liu, Jinman Zhao, Jiexin Fan, Qi Xu, Derek Greene

TL;DR

This work tackles accurate digitization of degraded historical documents by introducing PreP-OCR, a two-stage pipeline that first performs document image restoration trained on a richly synthetic degradation dataset and then applies semantic-aware post-OCR correction with ByT5. A key contribution is the synthetic data generation and a multi-directional patch fusion strategy (implemented in ResShift and related models) that enhances large-page restoration, complemented by a ByT5-based post-processing trained on OCR-error distributions. Across 13,831 pages in English, French, and Spanish, the pipeline achieves a substantial reduction in character error rates (CER) by about $63.9$–$70.3 imes$ compared with raw OCR, and demonstrates cross-lingual generalization for Latin-script languages. The findings highlight the practical potential of coupling image restoration with linguistic correction to improve archival digitization, while also revealing challenges with non-Latin scripts and language-model hallucinations in post-processing.

Abstract

This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

TL;DR

This work tackles accurate digitization of degraded historical documents by introducing PreP-OCR, a two-stage pipeline that first performs document image restoration trained on a richly synthetic degradation dataset and then applies semantic-aware post-OCR correction with ByT5. A key contribution is the synthetic data generation and a multi-directional patch fusion strategy (implemented in ResShift and related models) that enhances large-page restoration, complemented by a ByT5-based post-processing trained on OCR-error distributions. Across 13,831 pages in English, French, and Spanish, the pipeline achieves a substantial reduction in character error rates (CER) by about compared with raw OCR, and demonstrates cross-lingual generalization for Latin-script languages. The findings highlight the practical potential of coupling image restoration with linguistic correction to improve archival digitization, while also revealing challenges with non-Latin scripts and language-model hallucinations in post-processing.

Abstract

This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

Paper Structure

This paper contains 20 sections, 7 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Example images of digitized pages from historical books, which are often affected by degraded text, aging pages, and low capture resolution.
  • Figure 2: Example of three sets of synthetic image data. The leftmost image is the base image, while the image to its right is the corresponding degraded image.
  • Figure 3: The left panel shows a real degraded patch. The four sub-panels in the center depict restored outputs under different scanning directions, where the red circles highlight localized artifacts or noise. On the right is the final fused result, in which these artifacts are effectively suppressed.
  • Figure 4: Visualization of $\overline{\text{PSNR}}$ for selected methods. The blue boxes highlight different regions within the images. Central regions tend to exhibit higher $\overline{\text{PSNR}}$.
  • Figure 5: CER values for each book in the real dataset under different processing pipelines for 3 OCR systems. The green line indicates a decrease in CER, while the red line indicates an increase.
  • ...and 4 more figures