Table of Contents
Fetching ...

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Yadong Qu, Yuxin Wang, Bangbang Zhou, Zixiao Wang, Hongtao Xie, Yongdong Zhang

TL;DR

A new Character Unidirectional Alignment Loss is proposed to correct the derivation error in the previous character contrastive loss and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model.

Abstract

Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7\% and 70.9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at https://github.com/qqqyd/ViSu.

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

TL;DR

A new Character Unidirectional Alignment Loss is proposed to correct the derivation error in the previous character contrastive loss and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model.

Abstract

Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. The limitation lies in the insufficient exploration of character morphologies, including the monotonousness of widely used synthetic training data and the sensitivity of the model to character morphologies. To address these issues, inspired by the human learning process of viewing and summarizing, we facilitate the contrastive learning-based STR framework in a self-motivated manner by leveraging synthetic and real unlabeled data without any human cost. In the viewing process, to compensate for the simplicity of synthetic data and enrich character morphology diversity, we propose an Online Generation Strategy to generate background-free samples with diverse character styles. By excluding background noise distractions, the model is encouraged to focus on character morphology and generalize the ability to recognize complex samples when trained with only simple synthetic data. To boost the summarizing process, we theoretically demonstrate the derivation error in the previous character contrastive loss, which mistakenly causes the sparsity in the intra-class distribution and exacerbates ambiguity on challenging samples. Therefore, a new Character Unidirectional Alignment Loss is proposed to correct this error and unify the representation of the same characters in all samples by aligning the character features in the student model with the reference features in the teacher model. Extensive experiment results show that our method achieves SOTA performance (94.7\% and 70.9\% average accuracy on common benchmarks and Union14M-Benchmark). Code will be available at https://github.com/qqqyd/ViSu.

Paper Structure

This paper contains 27 sections, 20 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (a) shows some images from synthetic datasets MJSynth and SynthText. (b) and (c) show several challenging test images. (d) and (e) display the visualization of character feature distribution.
  • Figure 2: (a) All possible representations of English text images according to character orientation and reading order. (b) The unified representation forms of the word "standard" obtained through Online Generation Strategy. The first row with a red border shows two primary forms, and the second row can be obtained by rotating them 180 degrees.
  • Figure 3: Our framework consists of the student and teacher model. $\mathcal{L}_{rec}, \mathcal{L}_{ccr}, \mathcal{L}_{cua}$ mean recognition loss, character consistency regularization loss and character unidirectional alignment loss. Green and orange stand for labeled and unlabeled data, respectively.
  • Figure 4: (a) shows several challenging examples. The four lines from top to bottom represent the recognition results from ViSu, Baseline, ParSeq, and TRBA-cr. The first row shows examples with multiple directions. The second row displays examples with artistic or distorted characters. (b) and (c) are the visualizations of character features for CC Loss and CUA Loss, respectively.
  • Figure 5: Failure cases. The first line is the ground-truth, and the second line is the recognition results.