Table of Contents
Fetching ...

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

Jingjing Wu, Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Fanglin Chen, Guangming Lu, Wenjie Pei

TL;DR

WeCromCL tackles transcription-only supervised text spotting by reframing detection as weakly supervised atomistic cross-modality learning that produces an activation map $M$ and region-text correlation $c'$, locating the corresponding image region for each transcription without text boundaries. The method operates in two stages: first, WeCromCL detects anchor points via atomistic cross-modality learning; second, a single-point text spotter is trained with those pseudo location labels. Key contributions include character-wise text encoding, soft activation maps, negative-sample mining for cross-modality contrastive learning, and anchor-guided spotting that achieves competitive results across four benchmarks with reduced annotation cost. The approach enables effective transcription localization and spotting, with potential for generating pseudo labels to boost fully supervised OCR systems and for cross-task retrieval.

Abstract

Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

TL;DR

WeCromCL tackles transcription-only supervised text spotting by reframing detection as weakly supervised atomistic cross-modality learning that produces an activation map and region-text correlation , locating the corresponding image region for each transcription without text boundaries. The method operates in two stages: first, WeCromCL detects anchor points via atomistic cross-modality learning; second, a single-point text spotter is trained with those pseudo location labels. Key contributions include character-wise text encoding, soft activation maps, negative-sample mining for cross-modality contrastive learning, and anchor-guided spotting that achieves competitive results across four benchmarks with reduced annotation cost. The approach enables effective transcription localization and spotting, with potential for generating pseudo labels to boost fully supervised OCR systems and for cross-task retrieval.

Abstract

Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.
Paper Structure (17 sections, 9 equations, 7 figures, 7 tables)

This paper contains 17 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison between our WeCromCL and TCM Yu2023TurningAC, oCLIP Xue2022LanguageMA as well as VLPT song2022vision. (a) TCM distinguishes text regions from non-text regions in a scene image in a fully supervised manner using text polygon annotations. (b) Both oCLIP and VLPT perform holistic contrastive learning between the entire scene image and the text in a full supervised way w.r.t. the contrastive pairs to learn effective image encoder for downstream OCR tasks, while relying on the auxiliary task for optimization, namely predicting masked characters (oCLIP) or masked words (VLPT). (c) Our WeCromCL conducts atomistic contrastive learning to model the appearance consistency between a text transcription and its correlated region in the scene image for transcription-wise detection in a weakly supervised manner without text location annotations.
  • Figure 1: Visualization of activation maps learned by WeCromCL. Our WeCromCL can handle various complex cases, such as text with artistic fonts, curved text, long text, and small text. Given a text transcription, WeCromCL can generate corresponding activation map in which the highly activated region is identified as the anchor point for this transcription.
  • Figure 2: Architecture of our proposed transcription-only supervised text spotter. Our method consists of two stages: 1) detecting the anchor point for each text instance as pseudo location label by WeCromCL; 2) conducting text spotting under the supervision of obtained pseudo location labels.
  • Figure 2: Visualization of text spotting results on four benchmarks: (a) ICDAR 2013, (b) ICDAR 2015, (c) Total-Text and (d) CTW1500. The green '+' represents the estimated anchor point for each text instance. The blue dots denote the sampled points.
  • Figure 3: Visual comparison of corresponding attention maps in the decoder of (a) WeCromCL + SPTS and (b) NPTS.
  • ...and 2 more figures