Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Gangyan Zeng; Yuan Zhang; Jin Wei; Dongbao Yang; Peng Zhang; Yiwen Gao; Xugong Qin; Yu Zhou

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Gangyan Zeng, Yuan Zhang, Jin Wei, Dongbao Yang, Peng Zhang, Yiwen Gao, Xugong Qin, Yu Zhou

TL;DR

This work tackles scene text retrieval without OCR by leveraging CLIP and identifying two core challenges: limited text perceptual scale and entanglement between visual and semantic concepts. It proposes FDP, a three-stage framework that focuses CLIP on text regions, distinguishes query words into content and function categories, and applies semantic-aware prompting, complemented by a distracted-queries training signal. Empirical results across IIIT-STR, SVT, TotalText, and a new PSTR benchmark show that FDP delivers a strong speed–accuracy balance, surpassing CLIP baselines and competitive STR methods, with notable gains in phrase-level and attribute-aware retrieval. The approach demonstrates practical OCR-free retrieval with broad applicability and is supported by ablations and implementation details, making it suitable for real-world deployment and future research in cross-modal text retrieval.

Abstract

Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy compared to existing methods. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text. The source code will be publicly available at https://github.com/Gyann-z/FDP.

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 7 figures, 5 tables)

This paper contains 16 sections, 6 equations, 7 figures, 5 tables.

Introduction
Related Work
Scene Text Retrieval
Exploring CLIP's OCR Capabilities
FDP Method
Focus
Distinguish
Prompt
Optimization
Experiments
Datasets
Implementation Details
Comparison with Existing Methods
Ablation Study
Extending to More Retrieval Settings
...and 1 more sections

Figures (7)

Figure 1: Illustration of the trade-off between retrieval accuracy (mAP scores) and inference speed (FPS) on the IIIT-STR benchmark. Our proposed FDP method achieves better balance than previous methods.
Figure 2: Illustration of the scene text retrieval in (a) phrase-level and (b) attribute-aware settings. Unlike conventional STR models that rely on the local retrieval mechanism, FDP is more flexible in handling diverse forms of query text.
Figure 3: Overview of the proposed FDP model. It consists of three main parts: 1) Focus: Two main modules of dynamic attention shift and text knowledge probing are presented to highlight scene text information. 2) Distinguish: The query text is categorized into content words and function words via unsupervised clustering. 3) Prompt: The retrieval process is finally achieved by a semantic-aware prompting scheme, and meanwhile distracted queries are generated during training to assist in identifying similar words.
Figure 4: Details of the dynamic attention shift module.
Figure 5: Illustration of the effect caused by visual-semantic entanglement. (a) The t-SNE visualization of high-frequency scene text's CLIP language embeddings. (b) The comparison of the retrieval accuracy of three frozen CLIP models on content words and function words.
...and 2 more figures

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

TL;DR

Abstract

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (7)