Table of Contents
Fetching ...

Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing

Yan Shu, Weichao Zeng, Zhenhang Li, Fangmin Zhao, Yu Zhou

TL;DR

This survey formalizes visual text processing as a dual-track problem: text image enhancement/restoration and text image manipulation, framed by a formal mapping $f^*:\boldsymbol{X}\rightarrow\boldsymbol{Y}$. It introduces a hierarchical taxonomy spanning tasks and learning paradigms, and analyzes how text features—structure, stroke, semantics, style, and spatial context—are integrated into methods, including reconstruction-based and generative approaches such as GANs and diffusion models. The authors catalog publicly available datasets and benchmarks across SR, dewarping, denoising, removal, editing, and generation, and summarize performance with standard metrics, underscoring progress and remaining gaps. They highlight open challenges (data, metrics, efficiency, video extension, unified frameworks, and user interaction) and offer directions for future research to accelerate practical impact in real-world applications like document digitization, privacy-preserving editing, and AR-enabled text manipulation.

Abstract

Visual text, a pivotal element in both document and scene images, speaks volumes and attracts significant attention in the computer vision domain. Beyond visual text detection and recognition, the field of visual text processing has experienced a surge in research, driven by the advent of fundamental generative models. However, challenges persist due to the unique properties and features that distinguish text from general objects. Effectively leveraging these unique textual characteristics is crucial in visual text processing, as observed in our study. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in this field. Initially, we introduce a hierarchical taxonomy encompassing areas ranging from text image enhancement and restoration to text image manipulation, followed by different learning paradigms. Subsequently, we conduct an in-depth discussion of how specific textual features such as structure, stroke, semantics, style, and spatial context are seamlessly integrated into various tasks. Furthermore, we explore available public datasets and benchmark the reviewed methods on several widely-used datasets. Finally, we identify principal challenges and potential avenues for future research. Our aim is to establish this survey as a fundamental resource, fostering continued exploration and innovation in the dynamic area of visual text processing.

Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing

TL;DR

This survey formalizes visual text processing as a dual-track problem: text image enhancement/restoration and text image manipulation, framed by a formal mapping . It introduces a hierarchical taxonomy spanning tasks and learning paradigms, and analyzes how text features—structure, stroke, semantics, style, and spatial context—are integrated into methods, including reconstruction-based and generative approaches such as GANs and diffusion models. The authors catalog publicly available datasets and benchmarks across SR, dewarping, denoising, removal, editing, and generation, and summarize performance with standard metrics, underscoring progress and remaining gaps. They highlight open challenges (data, metrics, efficiency, video extension, unified frameworks, and user interaction) and offer directions for future research to accelerate practical impact in real-world applications like document digitization, privacy-preserving editing, and AR-enabled text manipulation.

Abstract

Visual text, a pivotal element in both document and scene images, speaks volumes and attracts significant attention in the computer vision domain. Beyond visual text detection and recognition, the field of visual text processing has experienced a surge in research, driven by the advent of fundamental generative models. However, challenges persist due to the unique properties and features that distinguish text from general objects. Effectively leveraging these unique textual characteristics is crucial in visual text processing, as observed in our study. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in this field. Initially, we introduce a hierarchical taxonomy encompassing areas ranging from text image enhancement and restoration to text image manipulation, followed by different learning paradigms. Subsequently, we conduct an in-depth discussion of how specific textual features such as structure, stroke, semantics, style, and spatial context are seamlessly integrated into various tasks. Furthermore, we explore available public datasets and benchmark the reviewed methods on several widely-used datasets. Finally, we identify principal challenges and potential avenues for future research. Our aim is to establish this survey as a fundamental resource, fostering continued exploration and innovation in the dynamic area of visual text processing.
Paper Structure (58 sections, 2 equations, 2 figures, 7 tables)

This paper contains 58 sections, 2 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Visualization samples of visual text processing tasks. The top row is the text image enhancement/restoration, including super-resolution noguchi2024scene, dewarpingMa2018DocUNetDI, and denoising Lin2020BEDSRNetAD. The bottom row is text image manipulation, including text removal peng2023viteraser, text editing yang2023self, and text generation zhan2019spatial.
  • Figure 2: Main structure of this survey. Initially, we introduce a hierarchical taxonomy from image enhancement and restoration to image manipulation, followed by different learning paradigms. Subsequently, we conduct an in-depth discussion of how specific textual features are integrated into various tasks. Furthermore, we explore public datasets and benchmark the reviewed methods. Finally, we identify open challenges for future research.