Table of Contents
Fetching ...

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

TL;DR

MSTAR introduces a box-free framework for multi-query scene text retrieval, addressing annotation costs and the need to support diverse query types. It combines Progressive Vision Embedding to expose fine-grained text, style-aware instructions to harmonize word/phrase/semantic queries, and a Multi-Instance Matching module to align vision and text representations without bounding boxes. The authors also present MQTR, a comprehensive benchmark with four query types over 16k images to evaluate multi-query capabilities. Through extensive experiments on seven public datasets and MQTR, MSTAR achieves competitive results with state-of-the-art box-based methods while significantly reducing annotation requirements, and demonstrates strong multi-query retrieval performance with re-ranking boosts.

Abstract

Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

TL;DR

MSTAR introduces a box-free framework for multi-query scene text retrieval, addressing annotation costs and the need to support diverse query types. It combines Progressive Vision Embedding to expose fine-grained text, style-aware instructions to harmonize word/phrase/semantic queries, and a Multi-Instance Matching module to align vision and text representations without bounding boxes. The authors also present MQTR, a comprehensive benchmark with four query types over 16k images to evaluate multi-query capabilities. Through extensive experiments on seven public datasets and MQTR, MSTAR achieves competitive results with state-of-the-art box-based methods while significantly reducing annotation requirements, and demonstrates strong multi-query retrieval performance with re-ranking boosts.

Abstract

Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.

Paper Structure

This paper contains 32 sections, 5 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: (a) MSTAR achieves scene text retrieval without the aid of box annotations. (b) Image-text matching experiments with VLM li2023blip. Detailed text instances like "welcome to beautiful" and "old florida" in the image receive lower matching scores. While manually covering salient text regions which receive the higher scores, the model can adaptively recognize the detailed text.
  • Figure 2: Overview of MSTAR. MSTAR is built upon four key components: a vision encoder $\phi$, the Progressive Vision Embedding (PVE), the multi-modal encoder $\psi$, and the multi-instance matching module (MIM). PVE incorporates image features $f_{\text{t}}$ and the mask $M_{\text{t}}$ derived from cross-attention map $C_t$, progressively shifting attention from salient features to fine-grained regions.
  • Figure 3: Visualization of the text localization of our MSTAR. The image shows the localization of (a) semantic, (b) phrase, and (c) combined query, as well as (d) curved and (e) dense word instances.
  • Figure 4: Qualitive analysis of VLMs to process in-salient text instances , which is introduced in Sec. \ref{['sec:intro']}, (a) BLIP-ViT-Large-384, (b) SigLIP-ViT-Base-512, and (c) SigLIP-ViT-Large-384.
  • Figure 5: Statistical analysis of the MQTR benchmark.
  • ...and 1 more figures