Table of Contents
Fetching ...

Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition

Tiancheng Lin, Jinglei Zhang, Yi Xu, Kai Chen, Rui Zhang, Chang-Wen Chen

TL;DR

A unified framework of Relational Contrastive Learning and Masked Image Modeling for STR (RCMSTR), which explicitly models the enriched textual relations, and introduces a novel decoupling design aimed at mitigating the impact of masked images on contrastive learning.

Abstract

Context-aware methods have achieved remarkable advancements in supervised scene text recognition by leveraging semantic priors from words. Considering the heterogeneity of text and background in STR, we propose that such contextual priors can be reinterpreted as the relations between textual elements, serving as effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of the dataset due to lexical dependencies, which causes over-fitting problem, thus compromising the representation quality. To address this, our work introduces a unified framework of Relational Contrastive Learning and Masked Image Modeling for STR (RCMSTR), which explicitly models the enriched textual relations. For the RCL branch, we first introduce the relational rearrangement module to cultivate new relations on the fly. Based on this, we further conduct relational contrastive learning to model the intra- and inter-hierarchical relations for frames, sub-words and words. On the other hand, MIM can naturally boost the context information via masking, where we find that the block masking strategy is more effective for STR. For the effective integration of RCL and MIM, we also introduce a novel decoupling design aimed at mitigating the impact of masked images on contrastive learning. Additionally, to enhance the compatibility of MIM with CNNs, we propose the adoption of sparse convolutions and directly sharing the weights with dense convolutions in training. The proposed RCMSTR demonstrates superior performance in various evaluation protocols for different STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques. Ablation studies and qualitative experimental results further validate the effectiveness of our method. The code and pre-trained models will be available at https://github.com/ThunderVVV/RCMSTR .

Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition

TL;DR

A unified framework of Relational Contrastive Learning and Masked Image Modeling for STR (RCMSTR), which explicitly models the enriched textual relations, and introduces a novel decoupling design aimed at mitigating the impact of masked images on contrastive learning.

Abstract

Context-aware methods have achieved remarkable advancements in supervised scene text recognition by leveraging semantic priors from words. Considering the heterogeneity of text and background in STR, we propose that such contextual priors can be reinterpreted as the relations between textual elements, serving as effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of the dataset due to lexical dependencies, which causes over-fitting problem, thus compromising the representation quality. To address this, our work introduces a unified framework of Relational Contrastive Learning and Masked Image Modeling for STR (RCMSTR), which explicitly models the enriched textual relations. For the RCL branch, we first introduce the relational rearrangement module to cultivate new relations on the fly. Based on this, we further conduct relational contrastive learning to model the intra- and inter-hierarchical relations for frames, sub-words and words. On the other hand, MIM can naturally boost the context information via masking, where we find that the block masking strategy is more effective for STR. For the effective integration of RCL and MIM, we also introduce a novel decoupling design aimed at mitigating the impact of masked images on contrastive learning. Additionally, to enhance the compatibility of MIM with CNNs, we propose the adoption of sparse convolutions and directly sharing the weights with dense convolutions in training. The proposed RCMSTR demonstrates superior performance in various evaluation protocols for different STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques. Ablation studies and qualitative experimental results further validate the effectiveness of our method. The code and pre-trained models will be available at https://github.com/ThunderVVV/RCMSTR .

Paper Structure

This paper contains 25 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: We propose RCMSTR, a unified SSL method for text images to fully utilize textual relations. RCMSTR learns richer textual relations in contrastive learning via rearrangement, hierarchy and interaction. Besides, the MIM simultaneously conducts patch and character-level reconstruction to fully learn the local and global relations of text. Finally, RCMSTR is integrated as a unified SSL framework characterized by a decoupled design and effective compatibility with both CNN and ViT.
  • Figure 2: Block diagram. Each image in a batch is augmented and processed by the Relational MIM and CL components. In MIM, the image undergoes masking based on a specific strategy to facilitate local and global relational modeling, followed by a prediction head that reconstructs the masked regions. In CL, the image is augmented twice and then fed separately into the online branch (top) and the momentum branch (bottom) of the encoder and projector to create pairs of representation maps. In the module of enriching relations, we randomly permute the image patches and reverse the permutation on their features. Next, for the hierarchical contrastive learning of these representations, we apply three predictors that transform them into frames, subwords and words, respectively. Finally, we apply the relational contrastive loss on the corresponding intra- and inter-hierarchical positive pairs.
  • Figure 3: Random masking of patches.
  • Figure 4: Horizontal block masking.
  • Figure 5: t-SNE results.
  • ...and 3 more figures