Table of Contents
Fetching ...

Diffusion-based Blind Text Image Super-Resolution

Yuzhe Zhang, Jiawei Zhang, Hao Li, Zhouxia Wang, Luwei Hou, Dongqing Zou, Liheng Bian

TL;DR

Extensive experiments on synthetic and real-world datasets demonstrate that the Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously.

Abstract

Recovering degraded low-resolution text images is challenging, especially for Chinese text images with complex strokes and severe degradation in real-world scenarios. Ensuring both text fidelity and style realness is crucial for high-quality text image super-resolution. Recently, diffusion models have achieved great success in natural image synthesis and restoration due to their powerful data distribution modeling abilities and data generation capabilities. In this work, we propose an Image Diffusion Model (IDM) to restore text images with realistic styles. For diffusion models, they are not only suitable for modeling realistic image distribution but also appropriate for learning text distribution. Since text prior is important to guarantee the correctness of the restored text structure according to existing arts, we also propose a Text Diffusion Model (TDM) for text recognition which can guide IDM to generate text images with correct structures. We further propose a Mixture of Multi-modality module (MoM) to make these two diffusion models cooperate with each other in all the diffusion steps. Extensive experiments on synthetic and real-world datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously.

Diffusion-based Blind Text Image Super-Resolution

TL;DR

Extensive experiments on synthetic and real-world datasets demonstrate that the Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously.

Abstract

Recovering degraded low-resolution text images is challenging, especially for Chinese text images with complex strokes and severe degradation in real-world scenarios. Ensuring both text fidelity and style realness is crucial for high-quality text image super-resolution. Recently, diffusion models have achieved great success in natural image synthesis and restoration due to their powerful data distribution modeling abilities and data generation capabilities. In this work, we propose an Image Diffusion Model (IDM) to restore text images with realistic styles. For diffusion models, they are not only suitable for modeling realistic image distribution but also appropriate for learning text distribution. Since text prior is important to guarantee the correctness of the restored text structure according to existing arts, we also propose a Text Diffusion Model (TDM) for text recognition which can guide IDM to generate text images with correct structures. We further propose a Mixture of Multi-modality module (MoM) to make these two diffusion models cooperate with each other in all the diffusion steps. Extensive experiments on synthetic and real-world datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously.
Paper Structure (12 sections, 1 equation, 6 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 1 equation, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Blind text image super-resolution results between different methods on synthetic and real-world text images. Our method can restore text images with high text fidelity and style realness under complex strokes, severe degradation, and various text styles.
  • Figure 2: Overview of Diffusion-based Blind Text Image Super-Resolution (DiffTSR) along with the baseline. (a) Our baseline model. It contains an Image Diffusion Model (IDM) and a text recognition model. The IDM performs the diffusion-based text image super-resolution conditioned on the latent feature $\mathbf{Z}_{LR}$ from the LR image and text prior $\mathbf{c}$ which is extracted by the text recognition model from the LR image. (b) DiffTSR architecture. It mainly consists of three parts: i) IDM performs the image diffusion conditioned on $\mathbf{Z}_{LR}$ and $\mathbf{C}cond_t$ to achieve the high-realness image generation, ii) TDM conducts the text diffusion conditioned on $\mathbf{I}cond_t$, which starts the reverse process from the initial text prior $\mathbf{c}_T$, to achieve more accurate text prior prediction and correction, iii) MoM module fuses and encodes the intermediate features of IDM and TDM at the previous step, and outputs the conditions $\mathbf{C}cond_t$ and $\mathbf{I}cond_t$ for the current time step. IDM and TDM cooperate with each other through MoM to finally achieve text image super-resolution with high fidelity and realness. (c) Details of MoM. It fuses $\mathbf{Z}_{LR}$, $\mathbf{Z}_{t}$, and $\mathbf{c}_t$ at step $t$, and encodes them into $\mathbf{I}cond_t$ and $\mathbf{C}cond_t$ for TDM and IDM respectively.
  • Figure 3: Motivation. To provide text prior for text image restoration, the baseline model recognizes text from degraded images which is inaccurate when the degradation is severe. With inaccurate text prior, the baseline model cannot restore text image with high text fidelity which is shown in (a). The proposed TDM and IDM can benefit from each other through MoM in DiffTSR and gradually recognizes more accurate text sequence and restore higher-quality text image through the reverse diffusion process which is shown in (b). The text sequences above each super-resolution result at different time steps are the recognized text characters used for blind image super-resolution and the characters in red are the mistakenly estimated ones.
  • Figure 4: Qualitative comparison for the synthetic dataset CTR-TSR-Test with different methods including SRCNN dong2015image, ESRGAN wang2018esrgan, NAFNet chen2022simple, TSRN wang2020scene, TBSRN chen2021scene, TATT ma2022text, MARCONet li2023learning and our method for $\times 4$ super-resolution.
  • Figure 5: Qualitative comparison for the real-world dataset RealCE ma2023benchmark with different methods including SRCNN dong2015image, ESRGAN wang2018esrgan, NAFNet chen2022simple, TSRN wang2020scene, TBSRN chen2021scene, TATT ma2022text, MARCONet li2023learning and our method for $\times 4$ super-resolution.
  • ...and 1 more figures