Table of Contents
Fetching ...

Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition

Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai

TL;DR

This paper introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination and highlighting the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition.

Abstract

Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at https://github.com/MelosY/CAM.

Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition

TL;DR

This paper introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination and highlighting the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition.

Abstract

Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at https://github.com/MelosY/CAM.
Paper Structure (33 sections, 6 equations, 8 figures, 9 tables)

This paper contains 33 sections, 6 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Text images with (a) various layouts, (b) diverse fonts, (c) perspective distortion, (d) clutter backgrounds, and (e) occlusion.
  • Figure 2: Illustration of our proposed CAM. $\mathbf{F}$ is the visual feature extracted from the backbone. $\mathbf{F}_c$ is the embedding of canonical glyph masks. $\mathbf{F}_r$ is the fused feature of canonical masks and backbone features.
  • Figure 3: Illustration of the discriminative canonical glyph segmentation module, comprising the class-aware segmentation component (left) and a canonical feature generation structure with a "$\cap$-shaped" design.
  • Figure 4: Illustration of the mask-guided feature alignment and fusion module. Firstly, a set of reference points is placed uniformly on the recognition feature maps, whose offsets are learned from the concatenated features of recognition features and canonical features. Subsequently, the deformed keys and values are projected from the sampled features according to the deformed points. Finally, regular multi-head attention is employed for fusion.
  • Figure 5: Visualization of the generated canonical class-aware masks in our model without refinement. The two strings near each image represent ground truth and prediction, respectively.
  • ...and 3 more figures