Table of Contents
Fetching ...

TextDoctor: Unified Document Image Inpainting via Patch Pyramid Diffusion Models

Wanglong Lu, Lingming Su, Jingjing Zheng, Vinícius Veloso de Melo, Farzaneh Shoeleh, John Hawkin, Terrence Tricco, Hanli Zhao, Xianta Jiang

TL;DR

TextDoctor addresses the challenge of high-resolution, unseen-document inpainting by learning text-element priors from patches and applying diffusion-driven restoration at the whole-document level. It introduces a structure pyramid prediction to capture multiscale text structures and a patch pyramid diffusion model to denoise documents via pyramid patches, enabling memory-efficient, high-quality inpainting across diverse document styles. Across seven public datasets, TextDoctor demonstrates competitive or superior performance to state-of-the-art methods without dataset-specific fine-tuning, with ablations validating the effectiveness of patch-based inference, multiscale structures, and pyramid patches. The work offers a practical, scalable solution for robust document restoration with clear implications for OCR accuracy and downstream document analysis, while noting limitations with very large masks and suggesting directions for faster ultra-high-resolution inference.

Abstract

Digital versions of real-world text documents often suffer from issues like environmental corrosion of the original document, low-quality scanning, or human interference. Existing document restoration and inpainting methods typically struggle with generalizing to unseen document styles and handling high-resolution images. To address these challenges, we introduce TextDoctor, a novel unified document image inpainting method. Inspired by human reading behavior, TextDoctor restores fundamental text elements from patches and then applies diffusion models to entire document images instead of training models on specific document types. To handle varying text sizes and avoid out-of-memory issues, common in high-resolution documents, we propose using structure pyramid prediction and patch pyramid diffusion models. These techniques leverage multiscale inputs and pyramid patches to enhance the quality of inpainting both globally and locally. Extensive qualitative and quantitative experiments on seven public datasets validated that TextDoctor outperforms state-of-the-art methods in restoring various types of high-resolution document images.

TextDoctor: Unified Document Image Inpainting via Patch Pyramid Diffusion Models

TL;DR

TextDoctor addresses the challenge of high-resolution, unseen-document inpainting by learning text-element priors from patches and applying diffusion-driven restoration at the whole-document level. It introduces a structure pyramid prediction to capture multiscale text structures and a patch pyramid diffusion model to denoise documents via pyramid patches, enabling memory-efficient, high-quality inpainting across diverse document styles. Across seven public datasets, TextDoctor demonstrates competitive or superior performance to state-of-the-art methods without dataset-specific fine-tuning, with ablations validating the effectiveness of patch-based inference, multiscale structures, and pyramid patches. The work offers a practical, scalable solution for robust document restoration with clear implications for OCR accuracy and downstream document analysis, while noting limitations with very large masks and suggesting directions for faster ultra-high-resolution inference.

Abstract

Digital versions of real-world text documents often suffer from issues like environmental corrosion of the original document, low-quality scanning, or human interference. Existing document restoration and inpainting methods typically struggle with generalizing to unseen document styles and handling high-resolution images. To address these challenges, we introduce TextDoctor, a novel unified document image inpainting method. Inspired by human reading behavior, TextDoctor restores fundamental text elements from patches and then applies diffusion models to entire document images instead of training models on specific document types. To handle varying text sizes and avoid out-of-memory issues, common in high-resolution documents, we propose using structure pyramid prediction and patch pyramid diffusion models. These techniques leverage multiscale inputs and pyramid patches to enhance the quality of inpainting both globally and locally. Extensive qualitative and quantitative experiments on seven public datasets validated that TextDoctor outperforms state-of-the-art methods in restoring various types of high-resolution document images.

Paper Structure

This paper contains 21 sections, 17 equations, 23 figures, 23 tables, 3 algorithms.

Figures (23)

  • Figure 1: The motivation of unified high-resolution document image inpainting. Fine-grain document patches share more visual similarities between FUNSD (a) (left) and BCSD (a) (right) datasets: (a) document images, (b) and (c) cropped patches; Fine-grain patches intersect more in feature space using t-SNE TSNE from FUNSD (red) and BCSD (blue): (d) distributions of document images, (e) and (f) distributions of image patches.
  • Figure 2: The training and testing pipelines of SOTA methods DocDiff (left) and TextDoctor (right). Once trained on the image patches, our TextDoctor can be applied to perform high-quality document image inpainting across various document types and resolutions.
  • Figure 3: The inference pipeline of TextDoctor. (a) First, we use PSPP to perform structure pyramid prediction from an upsampled corrupted image. (b) Then, we utilize the predicted structures to guide the patch pyramid denoising process and get the inpainted image, using CPPD. (c) Patch-based structure pyramid prediction (PSPP). (d) Conditional patch pyramid denoising (CPPD).
  • Figure 4: Visual comparisons to the SOTA document restoration and inpainting methods.
  • Figure 5: Visual comparisons to AnyText tuo2024anytext (Text prompts: A text line in a document with the words "843-720-9290").
  • ...and 18 more figures