Table of Contents
Fetching ...

Extremely Fine-Grained Visual Classification over Resembling Glyphs in the Wild

Fares Bougourzi, Fadi Dornaika, Chongsheng Zhang

TL;DR

This work tackles extremely fine-grained recognition of resembling glyphs in natural scenes by introducing RCC-FGVC and EL-FGVC benchmarks and a two-stage Siamese framework, CCFG-Net, that learns discriminative features in both Euclidean and angular spaces using supervised contrastive warm-up and joint contrastive/classification training. The method integrates $L_{SCL}$, $L_{LMCL}$, $L_e$, and $L_a$ losses and demonstrates substantial performance gains over state-of-the-art FGVC approaches across CNN and Transformer backbones, especially in low-shot regimes. The proposed benchmarks reveal challenging glyph-level similarity without semantic parts, while CCFG-Net provides robust recognition in the wild, with practical implications for scene text recognition and digital maps. The work opens avenues for richer resembling dictionaries and improved negative sampling strategies to further boost discriminability.

Abstract

Text recognition in the wild is an important technique for digital maps and urban scene understanding, in which the natural resembling properties between glyphs is one of the major reasons that lead to wrong recognition results. To address this challenge, we introduce two extremely fine-grained visual recognition benchmark datasets that contain very challenging resembling glyphs (characters/letters) in the wild to be distinguished. Moreover, we propose a simple yet effective two-stage contrastive learning approach to the extremely fine-grained recognition task of resembling glyphs discrimination. In the first stage, we utilize supervised contrastive learning to leverage label information to warm-up the backbone network. In the second stage, we introduce CCFG-Net, a network architecture that integrates classification and contrastive learning in both Euclidean and Angular spaces, in which contrastive learning is applied in both supervised learning and pairwise discrimination manners to enhance the model's feature representation capability. Overall, our proposed approach effectively exploits the complementary strengths of contrastive learning and classification, leading to improved recognition performance on the resembling glyphs. Comparative evaluations with state-of-the-art fine-grained classification approaches under both Convolutional Neural Network (CNN) and Transformer backbones demonstrate the superiority of our proposed method.

Extremely Fine-Grained Visual Classification over Resembling Glyphs in the Wild

TL;DR

This work tackles extremely fine-grained recognition of resembling glyphs in natural scenes by introducing RCC-FGVC and EL-FGVC benchmarks and a two-stage Siamese framework, CCFG-Net, that learns discriminative features in both Euclidean and angular spaces using supervised contrastive warm-up and joint contrastive/classification training. The method integrates , , , and losses and demonstrates substantial performance gains over state-of-the-art FGVC approaches across CNN and Transformer backbones, especially in low-shot regimes. The proposed benchmarks reveal challenging glyph-level similarity without semantic parts, while CCFG-Net provides robust recognition in the wild, with practical implications for scene text recognition and digital maps. The work opens avenues for richer resembling dictionaries and improved negative sampling strategies to further boost discriminability.

Abstract

Text recognition in the wild is an important technique for digital maps and urban scene understanding, in which the natural resembling properties between glyphs is one of the major reasons that lead to wrong recognition results. To address this challenge, we introduce two extremely fine-grained visual recognition benchmark datasets that contain very challenging resembling glyphs (characters/letters) in the wild to be distinguished. Moreover, we propose a simple yet effective two-stage contrastive learning approach to the extremely fine-grained recognition task of resembling glyphs discrimination. In the first stage, we utilize supervised contrastive learning to leverage label information to warm-up the backbone network. In the second stage, we introduce CCFG-Net, a network architecture that integrates classification and contrastive learning in both Euclidean and Angular spaces, in which contrastive learning is applied in both supervised learning and pairwise discrimination manners to enhance the model's feature representation capability. Overall, our proposed approach effectively exploits the complementary strengths of contrastive learning and classification, leading to improved recognition performance on the resembling glyphs. Comparative evaluations with state-of-the-art fine-grained classification approaches under both Convolutional Neural Network (CNN) and Transformer backbones demonstrate the superiority of our proposed method.
Paper Structure (19 sections, 7 equations, 6 figures, 8 tables)

This paper contains 19 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An illustration showing the differences between conventional fine-grained object recognition (e.g., bird species) and resembling glyphs discrimination. The former typically have common semantic parts, which are not available in the latter, making it a significantly challenging and extremely fine-grained visual recognition task.
  • Figure 2: Examples of Chinese Resembling Characters. For each scene character image in the figure, we provide both the ground truth character/class and the predicted class, respectively.
  • Figure 3: An overview of the proposed approach. In which, two training stages are proposed and a testing phase is depicted.
  • Figure 4: The expected effect of losses on the deep features of three images. We consider three deep spaces: (i) the standard CE loss function, (ii) The first head of our CCFG-Net approach (${\mathcal{L}}_{Focal} + \lambda \, {\mathcal{L}}_{e}$), and (iii) The second head of our CCFG-Net approach (${\mathcal{L}}_{LMCL} + \lambda \, {\mathcal{L}}_{a}$). The labels of Samples A and B is $class_1$ (饼) and the label of Sample C is $class_2$ (博). Sample A deep features in CE, z and z' are A, a and a' spaces, respectively. Sample B deep features in CE, z and z' spaces are B, b and b', respectively. Sample C deep features in CE, z and z' spaces are C, c and c', respectively. The objective of the first head is to minimize distance $L_1$ and maximize the distances $L_2$ and $L_3$. On the other hand, the objective of the second head is to minimize $\alpha_1$$\alpha_2$$\alpha_3$ and $\alpha_6$ and maximize $\alpha_4$ and $\alpha_5$.
  • Figure 5: Study of the Negative Pairs Ratio to Positive Pairs for ResNet-50 and ViT Base Backbones.
  • ...and 1 more figures