Table of Contents
Fetching ...

Predicting the Original Appearance of Damaged Historical Documents

Zhenhua Yang, Dezhi Peng, Yongxin Shi, Yuyi Zhang, Chongyu Liu, Lianwen Jin

TL;DR

This work defines Historical Document Repair (HDR) as predicting the original appearance of damaged historical documents, addressing the gap left by traditional low-level restoration methods. It introduces HDR28K, a large synthetic dataset with 28,552 damaged-repaired image pairs and multi-style degradations, and DiffHDR, a diffusion-based model that conditions restoration on semantic content and spatial cues while employing a character perceptual loss to preserve character fidelity. DiffHDR achieves state-of-the-art results on HDR28K across FID, LPIPS, and Rec-ACC, and demonstrations on real damaged documents indicate strong generalization. The framework also supports document editing and text block font generation, highlighting practical applications for cultural heritage preservation and digital humanities.

Abstract

Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at https://github.com/yeungchenwa/HDR.

Predicting the Original Appearance of Damaged Historical Documents

TL;DR

This work defines Historical Document Repair (HDR) as predicting the original appearance of damaged historical documents, addressing the gap left by traditional low-level restoration methods. It introduces HDR28K, a large synthetic dataset with 28,552 damaged-repaired image pairs and multi-style degradations, and DiffHDR, a diffusion-based model that conditions restoration on semantic content and spatial cues while employing a character perceptual loss to preserve character fidelity. DiffHDR achieves state-of-the-art results on HDR28K across FID, LPIPS, and Rec-ACC, and demonstrations on real damaged documents indicate strong generalization. The framework also supports document editing and text block font generation, highlighting practical applications for cultural heritage preservation and digital humanities.

Abstract

Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at https://github.com/yeungchenwa/HDR.

Paper Structure

This paper contains 26 sections, 6 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Definition of Historical Document Repair (HDR) task. The green boxes represent the damaged regions and the blue boxes denote the repaired regions.
  • Figure 2: Damaged Samples in HDR28K.
  • Figure 3: Construction pipeline of the HDR28K dataset.
  • Figure 4: Statistics of the HDR28K dataset.
  • Figure 5: Overview of our proposed method. DiffHDR comprises a condition parsing and a diffusion pipeline. In the condition parsing, the user provides the content and location of damaged characters, obtaining the content image $\boldsymbol{x}_{c}$ and mask image $\boldsymbol{x}_{m}$. In the diffusion pipeline, our denoiser $\mathcal{F}$, a UNet-based network, outputs the repaired image $\boldsymbol{x}_{r}$ conditioned on noised image $\boldsymbol{x}_{t}$, damaged image $\boldsymbol{x}_{d}$, mask image $\boldsymbol{x}_{m}$ and content image $\boldsymbol{x}_{c}$. During training, in addition to using diffusion loss $\mathcal{L}_{diff}$, we introduce a character perceptual loss $\mathcal{L}_{CP}$ to enhance the content preservation of repaired characters.
  • ...and 12 more figures