Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

Bangbang Zhou; Yadong Qu; Zixiao Wang; Zicheng Li; Boqiang Zhang; Hongtao Xie

Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

Bangbang Zhou, Yadong Qu, Zixiao Wang, Zicheng Li, Boqiang Zhang, Hongtao Xie

TL;DR

The paper tackles severe distortions in scene text recognition by addressing large intra-class variance and small inter-class variance. It introduces the Character Features Enriched model (CFE), combining a Character-Aware Constraint Encoder (CACE) that uses a decay-based attention mechanism to capture local morphology with an Intra-Inter Consistency Loss (I^2CL) that learns long-term memory units for character classes to enforce intra-class compactness and inter-class separability. The approach achieves state-of-the-art results on common benchmarks (≈94.1% accuracy) and the Union14M-Benchmark (≈61.6% AVG) with efficient parameters, while providing insights through ablations and visualizations. This work advances robust STR by integrating local-pattern encoding with global distribution modeling, enabling better recognition of challenging, curved, or artistic text.

Abstract

Recently, scene text recognition (STR) models have shown significant performance improvements. However, existing models still encounter difficulties in recognizing challenging texts that involve factors such as severely distorted and perspective characters. These challenging texts mainly cause two problems: (1) Large Intra-Class Variance. (2) Small Inter-Class Variance. An extremely distorted character may prominently differ visually from other characters within the same category, while the variance between characters from different classes is relatively small. To address the above issues, we propose a novel method that enriches the character features to enhance the discriminability of characters. Firstly, we propose the Character-Aware Constraint Encoder (CACE) with multiple blocks stacked. CACE introduces a decay matrix in each block to explicitly guide the attention region for each token. By continuously employing the decay matrix, CACE enables tokens to perceive morphological information at the character level. Secondly, an Intra-Inter Consistency Loss (I^2CL) is introduced to consider intra-class compactness and inter-class separability at feature space. I^2CL improves the discriminative capability of features by learning a long-term memory unit for each character category. Trained with synthetic data, our model achieves state-of-the-art performance on common benchmarks (94.1% accuracy) and Union14M-Benchmark (61.6% accuracy). Code is available at https://github.com/bang123-box/CFE.

Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 5 figures, 8 tables)

This paper contains 22 sections, 4 equations, 5 figures, 8 tables.

Introduction
Related Work
Scene Text Recognition
Contrastive Learning in STR
Proposed Method
Pipeline
Character-Aware Constraint Encoder
Intra-Inter Consistency Loss
Training Objective
Experiment
Datasets
Implementation Details
Evaluation Metric
Comparisons with State-of-the-Arts
Ablation Study
...and 7 more sections

Figures (5)

Figure 1: Differences between simple and challenging texts. (a) Simple texts are singular in style and uniform in size. (b) With variations in appearances and size, the character 't’ is misrecognized. (c) The similarity in appearances of different category characters leads to wrong recognition. The first line is the label and the second line is the prediction with our baseline model. The incorrectly recognized characters are highlighted in red.
Figure 2: The framework of our CFE. The pipeline is composed of two key components: CACE and $\text{I}^{2}\text{CL}$. CACE explores the local patterns within character by utilizing the decay matrix. $\text{I}^{2}\text{CL}$ uses a set of learnable long-term memory units to represent the global character feature distribution in the decoding space. CE loss denotes the cross entropy loss. DW means the 2x downsampling at height dimension using CNN.
Figure 3: Visualization of three different options for generating decay matrix.
Figure 4: Visualization of attention maps in CACE. In the first column, the red point in each image is the query token. The second column images imply we use the baseline to calculate the attention scores between the red point and all points. The third column images mean CFE is used to calculate the attention scores.
Figure 5: Visualization of character feature distribution. The feature points in the red rectangle mean the mixture distribution. Zoom in for better visualization.

Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

TL;DR

Abstract

Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)