Table of Contents
Fetching ...

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

Ziye Fang, Xin Jiang, Hao Tang, Zechao Li

TL;DR

Ultra-fine-grained visual categorization suffers from extreme data scarcity per class and high inter/intra-class similarity. The authors propose CSDNet, a triad of modules—SSDP for instance-level adaptive augmentation, DDL with a dynamic memory queue for feature-level contrastive learning, and SSDT for logit-level self-distillation—to jointly improve discriminative representations under limited samples. The training objective combines $\mathcal{L} = \mathcal{L}_{cls} + \alpha \mathcal{L}_{qc} + \beta \mathcal{L}_{ssdt}$, leveraging $\iota$-length memory queues, a margin $\xi$ in contrastive loss, and adaptive masks derived from a content-aware kernel, while inference relies on raw predictions to maintain efficiency. Empirical results on nine datasets show state-of-the-art performance on Ultra-FGVC and strong generalization on fine-grained tasks, validating the practical impact of contrastive self-distillation for learning robust representations with limited data.

Abstract

In the field of intelligent multimedia analysis, ultra-fine-grained visual categorization (Ultra-FGVC) plays a vital role in distinguishing intricate subcategories within broader categories. However, this task is inherently challenging due to the complex granularity of category subdivisions and the limited availability of data for each category. To address these challenges, this work proposes CSDNet, a pioneering framework that effectively explores contrastive learning and self-distillation to learn discriminative representations specifically designed for Ultra-FGVC tasks. CSDNet comprises three main modules: Subcategory-Specific Discrepancy Parsing (SSDP), Dynamic Discrepancy Learning (DDL), and Subcategory-Specific Discrepancy Transfer (SSDT), which collectively enhance the generalization of deep models across instance, feature, and logit prediction levels. To increase the diversity of training samples, the SSDP module introduces adaptive augmented samples to spotlight subcategory-specific discrepancies. Simultaneously, the proposed DDL module stores historical intermediate features by a dynamic memory queue, which optimizes the feature learning space through iterative contrastive learning. Furthermore, the SSDT module effectively distills subcategory-specific discrepancies knowledge from the inherent structure of limited training data using a self-distillation paradigm at the logit prediction level. Experimental results demonstrate that CSDNet outperforms current state-of-the-art Ultra-FGVC methods, emphasizing its powerful efficacy and adaptability in addressing Ultra-FGVC tasks.

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

TL;DR

Ultra-fine-grained visual categorization suffers from extreme data scarcity per class and high inter/intra-class similarity. The authors propose CSDNet, a triad of modules—SSDP for instance-level adaptive augmentation, DDL with a dynamic memory queue for feature-level contrastive learning, and SSDT for logit-level self-distillation—to jointly improve discriminative representations under limited samples. The training objective combines , leveraging -length memory queues, a margin in contrastive loss, and adaptive masks derived from a content-aware kernel, while inference relies on raw predictions to maintain efficiency. Empirical results on nine datasets show state-of-the-art performance on Ultra-FGVC and strong generalization on fine-grained tasks, validating the practical impact of contrastive self-distillation for learning robust representations with limited data.

Abstract

In the field of intelligent multimedia analysis, ultra-fine-grained visual categorization (Ultra-FGVC) plays a vital role in distinguishing intricate subcategories within broader categories. However, this task is inherently challenging due to the complex granularity of category subdivisions and the limited availability of data for each category. To address these challenges, this work proposes CSDNet, a pioneering framework that effectively explores contrastive learning and self-distillation to learn discriminative representations specifically designed for Ultra-FGVC tasks. CSDNet comprises three main modules: Subcategory-Specific Discrepancy Parsing (SSDP), Dynamic Discrepancy Learning (DDL), and Subcategory-Specific Discrepancy Transfer (SSDT), which collectively enhance the generalization of deep models across instance, feature, and logit prediction levels. To increase the diversity of training samples, the SSDP module introduces adaptive augmented samples to spotlight subcategory-specific discrepancies. Simultaneously, the proposed DDL module stores historical intermediate features by a dynamic memory queue, which optimizes the feature learning space through iterative contrastive learning. Furthermore, the SSDT module effectively distills subcategory-specific discrepancies knowledge from the inherent structure of limited training data using a self-distillation paradigm at the logit prediction level. Experimental results demonstrate that CSDNet outperforms current state-of-the-art Ultra-FGVC methods, emphasizing its powerful efficacy and adaptability in addressing Ultra-FGVC tasks.
Paper Structure (31 sections, 13 equations, 10 figures, 12 tables)

This paper contains 31 sections, 13 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: An illustration of the differences among Ultra-FGVC, FGVC, and general categorization, and the data size available for each class in Ultra-FGVC. Img./Cls. denotes the number of samples in each category. As the granularity of data categorization becomes increasingly fine, the number of available samples per class in the dataset becomes very limited, with an average of only 3 to 11 images per class.
  • Figure 2: An overview of the proposed method. Firstly, a feature $\mathbf X$ is extracted from the input image $\mathbf I$ using the backbone. Secondly, the Subcategory-Specific Discrepancy Parsing (SSDP) module uses a content-aware kernel to obtain a pattern map $\mathbf P$. Using this map, a subcategory-specific discrepancy mask is created, which generates the augmented image $\mathbf I'$. This augmented image is then sent to the backbone for retraining. The pattern map $\mathbf P$, after activation by the sigmoid function, is applied to the feature $\mathbf X$ to enhance the representation of subcategory-specific discrepancies. Thirdly, historical image features are stored in a memory queue and these current features are then integrated using queue contrastive learning. Lastly, self-distillation is leveraged to distill subcategory-specific knowledge between raw and augmented images.
  • Figure 3: The implementation details of generating augmented images with the SSDP module. The binarization of the image is implemented by Eq. (\ref{['eq:1']}).
  • Figure 4: Visual comparison of, CLE-ViT (line 2) and proposed CSDNet (line 3) with the original image (line 1) on the CUB, SoyLocal, Cotton80 dataset.
  • Figure 5: Visualize the tSNE visualization of features learned on the Cotton80 dataset for four settings: (a) Baseline, (b) Baseline + SSDP, (c) Baseline + SSDP + DDL, and (d) Baseline + SSDP + DDL + SSDT (full model). Each color represents a unique subcategory (a total of 10 subcategories).
  • ...and 5 more figures