Table of Contents
Fetching ...

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Yifei Zhang, Chang Liu, Jin Wei, Xiaomeng Yang, Yu Zhou, Can Ma, Xiangyang Ji

TL;DR

Linguistics-aware Masked Image Modeling (LMIM) addresses the duality of scene text images by integrating linguistic cues into self-supervised masked image modeling. The approach uses a dual-branch architecture with a linguistics guidance pathway that processes a guidance view—containing identical text but different visuals—to extract vision-independent linguistic features, enforced through a linguistic alignment loss $L_{align}$ and a reconstruction loss $L_{recon}$, combined as $L = L_{recon} + L_{align}$. Empirical results show state-of-the-art performance on English and Chinese STR benchmarks, with strong gains when pre-training on large unlabeled data such as Union14M-U and 11M Chinese text images, and clear qualitative evidence from attention visualizations of linguistic-aware representations. The work highlights the practical benefit of explicitly incorporating linguistic information into self-supervised visual modeling, improving global context understanding and robustness in scene text recognition.

Abstract

Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

TL;DR

Linguistics-aware Masked Image Modeling (LMIM) addresses the duality of scene text images by integrating linguistic cues into self-supervised masked image modeling. The approach uses a dual-branch architecture with a linguistics guidance pathway that processes a guidance view—containing identical text but different visuals—to extract vision-independent linguistic features, enforced through a linguistic alignment loss and a reconstruction loss , combined as . Empirical results show state-of-the-art performance on English and Chinese STR benchmarks, with strong gains when pre-training on large unlabeled data such as Union14M-U and 11M Chinese text images, and clear qualitative evidence from attention visualizations of linguistic-aware representations. The work highlights the practical benefit of explicitly incorporating linguistic information into self-supervised visual modeling, improving global context understanding and robustness in scene text recognition.

Abstract

Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.

Paper Structure

This paper contains 15 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustration of our motivation. (a) Existing studies demonstrate that both visual and linguistic information are crucial for STR, as linguistic information can complement visual features. (b) Current self-supervised STR approaches, such as sequence contrastive learning (SeqCLR) and masked image modeling (MIM), primarily focus on local region alignment or rely on local visual information for reconstruction, often neglecting the integration of visual and linguistic information at a global level. To address this, our LMIM method channels linguistic information into the decoding process of MIM. (c) Attention maps reveal that SeqCLR lacks character structure information, while MIM emphasizes local regions. Our LMIM effectively captures the global context based on vision and linguistics. The red box in the input image indicates the query.
  • Figure 2: Overview of our framework. Based on the dual-branch structure, the reconstruction loss and alignment loss are jointly optimized.
  • Figure 3: Visualization of attention maps. The red box in the input image refers to the query.