Table of Contents
Fetching ...

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Zuan Gao, Yuxin Wang, Yadong Qu, Boqiang Zhang, Zixiao Wang, Jianjun Xu, Hongtao Xie

TL;DR

This work addresses the gap in self-supervised scene text recognition where linguistic information is underlearned by existing methods. It introduces Symmetric Superimposition Modeling (SSM), a masking-free pretraining framework that uses symmetric overlays of an image with an inverted view to jointly reconstruct direction-specific pixel and feature signals in a Siamese online/target setup. The approach includes a pixel-level reconstruction pathway with direction-conditioned prompts and a feature-level reconstruction pathway with discriminative and dense losses, yielding $L = L_{pix} + \alpha (L_{dis} + L_{den})$ and enabling robust multilingual generalization. Empirically, SSM achieves substantial gains on Union14M and competitive performance on standard STR benchmarks, with strong improvements across languages and downstream tasks like segmentation and super-resolution, demonstrating the practical impact of integrating linguistic learning into visual-space self-supervision.

Abstract

In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

TL;DR

This work addresses the gap in self-supervised scene text recognition where linguistic information is underlearned by existing methods. It introduces Symmetric Superimposition Modeling (SSM), a masking-free pretraining framework that uses symmetric overlays of an image with an inverted view to jointly reconstruct direction-specific pixel and feature signals in a Siamese online/target setup. The approach includes a pixel-level reconstruction pathway with direction-conditioned prompts and a feature-level reconstruction pathway with discriminative and dense losses, yielding and enabling robust multilingual generalization. Empirically, SSM achieves substantial gains on Union14M and competitive performance on standard STR benchmarks, with strong improvements across languages and downstream tasks like segmentation and super-resolution, demonstrating the practical impact of integrating linguistic learning into visual-space self-supervision.

Abstract

In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.
Paper Structure (29 sections, 6 equations, 16 figures, 12 tables)

This paper contains 29 sections, 6 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: The comparison with mainstream self-supervised text recognition methods and our SSM. Rotate, VFlip and HFlip Views stand for the symmetrically augmented image created through 180-degree rotation, vertical flipping, and horizontal flipping. HS, VS, and RS views respectively represent images formed by superimposing HFlip, VFlip, and Rotate View with the Origin View.
  • Figure 2: The pre-training framework of SSM. The blue arrow and green arrow stand for the workflow of the online branch and target branch respectively. Origin View: original image, HFlip View: horizontally flipped image, VFlip View: vertically flipped image, Rotate View: 180-degree rotated image. $T_P$ and $T_n$ correspond to the original and the reversed text direction, respectively.
  • Figure 3: Reconstruction Visualization. GT: the original image. HS/ VS/ RS: horizontal/ vertical / roteated superimposed input. Pre. indicates the pixel prediction of GT. GT-H/ V/ R: the inverted view of the GT (HFlip, VFlip, 180-degree rotation view, respectively). Pre.-H/ V/ R:the pixel prediction of GT-H/ V/ R.
  • Figure 4: Comparison of feature representation evaluation.
  • Figure 5: Fine-tuning on ARD with different ratios.
  • ...and 11 more figures