Table of Contents
Fetching ...

PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution

Zuoyan Zhao, Hui Xue, Pengfei Fang, Shipeng Zhu

TL;DR

This work tackles the STISR problem by jointly improving visual structure and semantic readability of scene text images. It introduces PEAN, a Prior-Enhanced Attention Network that fuses an Attention-based Modulation Module (AMM) for long-range visual coherence with a diffusion-based Text Prior Enhancement Module (TPEM) to yield an Enhanced Text Prior (ETP) guiding SR toward semantic accuracy. Through multi-task learning (image restoration and text recognition), PEAN attains state-of-the-art results on the TextZoom benchmark and demonstrates strong robustness across text lengths and recognizers. The study also provides extensive ablations showing the AMM’s local-global benefits, the ETP’s semantic guidance, and the effectiveness of the pre-training and MTL strategy, highlighting PEAN’s practical impact on STISR pipelines.

Abstract

Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, an attention-based modulation module is leveraged to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. Meanwhile, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Additionally, a multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code is available at https://github.com/jdfxzzy/PEAN.

PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution

TL;DR

This work tackles the STISR problem by jointly improving visual structure and semantic readability of scene text images. It introduces PEAN, a Prior-Enhanced Attention Network that fuses an Attention-based Modulation Module (AMM) for long-range visual coherence with a diffusion-based Text Prior Enhancement Module (TPEM) to yield an Enhanced Text Prior (ETP) guiding SR toward semantic accuracy. Through multi-task learning (image restoration and text recognition), PEAN attains state-of-the-art results on the TextZoom benchmark and demonstrates strong robustness across text lengths and recognizers. The study also provides extensive ablations showing the AMM’s local-global benefits, the ETP’s semantic guidance, and the effectiveness of the pre-training and MTL strategy, highlighting PEAN’s practical impact on STISR pipelines.

Abstract

Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, an attention-based modulation module is leveraged to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. Meanwhile, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Additionally, a multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code is available at https://github.com/jdfxzzy/PEAN.
Paper Structure (36 sections, 11 equations, 10 figures, 24 tables)

This paper contains 36 sections, 11 equations, 10 figures, 24 tables.

Figures (10)

  • Figure 1: Comparison between previous text prior-based STISR methods (row (b, c)) and PEAN. The incorporation of AMM enables PEAN to restore the visual structure of lengthy text in images. However, its performance is limited by the absence of semantic information (row (d)). The introduction of TP-LR partially addresses this limitation, yet its efficacy remains inadequate, leading to several failure cases (row (e)). Considering that TP-HR is a robust alternative, we conduct an exploratory experiment by substituting TP-HR with TP-LR, resulting in superior performance (row (f)). This inspires us to design a module for enhancing the TP-LR so as to obtain the ETP, which demonstrates comparable effectiveness to TP-HR in guiding the SR process (row (g)).
  • Figure 2: Overview of the architecture of our proposed Prior-Enhanced Attention Network (PEAN).
  • Figure 3: Overview of the architecture of the FAM and the strip-wise attention mechanism inside LAM and GAM.
  • Figure 4: Statistics on the performance of different text prior-based models with publicly available weights on images containing text of different lengths.
  • Figure 5: Visualization of SR images and their recognition results by ASTER. Red characters indicate wrong recognition results.
  • ...and 5 more figures