Table of Contents
Fetching ...

Hear the Scene: Audio-Enhanced Text Spotting

Jing Li, Bo Wang

TL;DR

This work tackles the high annotation cost of scene text spotting by proposing EchoSpot, a transcription-only supervisory framework that learns implicit text locations via a query-based cross-attention mechanism. It introduces a coarse-to-fine localization module, a Hungarian matching-based loss, and a circular curriculum learning strategy to enable end-to-end training from scratch, with optional audio-based annotations to further reduce labeling effort. The approach achieves competitive performance on standard benchmarks, particularly excelling with curved or irregular text, while substantially reducing the need for geometric annotations. The combination of transcription-only supervision, audio annotation, and curriculum-guided training offers a practical path toward scalable, accessible optical character recognition in real-world scenes.

Abstract

Recent advancements in scene text spotting have focused on end-to-end methodologies that heavily rely on precise location annotations, which are often costly and labor-intensive to procure. In this study, we introduce an innovative approach that leverages only transcription annotations for training text spotting models, substantially reducing the dependency on elaborate annotation processes. Our methodology employs a query-based paradigm that facilitates the learning of implicit location features through the interaction between text queries and image embeddings. These features are later refined during the text recognition phase using an attention activation map. Addressing the challenges associated with training a weakly-supervised model from scratch, we implement a circular curriculum learning strategy to enhance model convergence. Additionally, we introduce a coarse-to-fine cross-attention localization mechanism for more accurate text instance localization. Notably, our framework supports audio-based annotation, which significantly diminishes annotation time and provides an inclusive alternative for individuals with disabilities. Our approach achieves competitive performance against existing benchmarks, demonstrating that high accuracy in text spotting can be attained without extensive location annotations.

Hear the Scene: Audio-Enhanced Text Spotting

TL;DR

This work tackles the high annotation cost of scene text spotting by proposing EchoSpot, a transcription-only supervisory framework that learns implicit text locations via a query-based cross-attention mechanism. It introduces a coarse-to-fine localization module, a Hungarian matching-based loss, and a circular curriculum learning strategy to enable end-to-end training from scratch, with optional audio-based annotations to further reduce labeling effort. The approach achieves competitive performance on standard benchmarks, particularly excelling with curved or irregular text, while substantially reducing the need for geometric annotations. The combination of transcription-only supervision, audio annotation, and curriculum-guided training offers a practical path toward scalable, accessible optical character recognition in real-world scenes.

Abstract

Recent advancements in scene text spotting have focused on end-to-end methodologies that heavily rely on precise location annotations, which are often costly and labor-intensive to procure. In this study, we introduce an innovative approach that leverages only transcription annotations for training text spotting models, substantially reducing the dependency on elaborate annotation processes. Our methodology employs a query-based paradigm that facilitates the learning of implicit location features through the interaction between text queries and image embeddings. These features are later refined during the text recognition phase using an attention activation map. Addressing the challenges associated with training a weakly-supervised model from scratch, we implement a circular curriculum learning strategy to enhance model convergence. Additionally, we introduce a coarse-to-fine cross-attention localization mechanism for more accurate text instance localization. Notably, our framework supports audio-based annotation, which significantly diminishes annotation time and provides an inclusive alternative for individuals with disabilities. Our approach achieves competitive performance against existing benchmarks, demonstrating that high accuracy in text spotting can be attained without extensive location annotations.
Paper Structure (17 sections, 3 figures, 6 tables)

This paper contains 17 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Different annotation styles for text spotting. The blue point and lines are the position annotation for the text. The transcription annotation "Hills" is displayed in the top left corner of each image.
  • Figure 2: Overview of the proposed EchoSpot model architecture. The visual and contextual embeddings are first extracted by a backbone network. The embeddings are then decoded to focus on the relevant text regions using a query-based cross-attention module. Next, the refine-stage text query, the mask generated by the attention activation map, and image embeddings are decoded together to obtain a refined text position.
  • Figure 3: Qualitative results. Images are selected from ICDAR 2013 (first col.), SCUT-CTW1500 (second col.), Total-Text (third col.), and ICDAR 2015 (fourth col.). The first row contains visualizations of single-point, while the second row contains visualizations of masks. As shown in the figure, our method is robust against various text types, including long text, large text, small text, curved text, perspective text, and fuzzy text.