Table of Contents
Fetching ...

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

Tianxu Wu, Shuo Ye, Shuhuang Chen, Qinmu Peng, Xinge You

TL;DR

This work tackles fine-grained visual categorization under few-shot constraints by integrating a diffusion-based augmentation framework, DRDM, with two key modules: Discriminative Semantic Recombination (DSR) and Spatial Knowledge Reference (SKR). DSR constrains diffusion-generated data using label-text semantic relationships via adapters and CLIP alignment to preserve discriminative details, while SKR anchors high-dimensional feature distributions with external datasets to expand decision boundaries. Together, they reduce feature contamination and improve class separability, achieving consistent gains over state-of-the-art methods across CUB, Dogs, and Cars in 1-shot and 5-shot settings. The approach demonstrates strong transferability to other FGSL models and highlights the value of leveraging cross-dataset knowledge for robust few-shot fine-grained recognition in practical applications.

Abstract

The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic mapping between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.

Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

TL;DR

This work tackles fine-grained visual categorization under few-shot constraints by integrating a diffusion-based augmentation framework, DRDM, with two key modules: Discriminative Semantic Recombination (DSR) and Spatial Knowledge Reference (SKR). DSR constrains diffusion-generated data using label-text semantic relationships via adapters and CLIP alignment to preserve discriminative details, while SKR anchors high-dimensional feature distributions with external datasets to expand decision boundaries. Together, they reduce feature contamination and improve class separability, achieving consistent gains over state-of-the-art methods across CUB, Dogs, and Cars in 1-shot and 5-shot settings. The approach demonstrates strong transferability to other FGSL models and highlights the value of leveraging cross-dataset knowledge for robust few-shot fine-grained recognition in practical applications.

Abstract

The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic mapping between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.
Paper Structure (15 sections, 19 equations, 7 figures, 8 tables)

This paper contains 15 sections, 19 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Feature contamination resulting from semantic misalignment during data augmentation using large models. This is specifically evident in the form of (a) irrelevant augmented data and (b) the loss of discriminative details.
  • Figure 2: Overview of the proposed method. Our framework first uses DSR to constrain similarity relations from the labels, thereby enhancing the model's understanding of instance-specific features. Then, during the classification process, we introduce instance features from different datasets for comparative reference, ensuring that the learned features possess a stronger representational capacity and robustness.
  • Figure 3: Analysis of dataset feature distributions. All features are extracted using a pre-trained ResNet-50. (a) Qualitative analysis, where points of different colors represent different datasets. (b) Quantitative analysis, which calculates the distances between the centers of each dataset.
  • Figure 4: The impact of different N and $\beta$ on the learning process. The horizontal and vertical axes represent the number of introduced subclasses and the selection of $\beta$ values, respectively. The first and second rows represent the results for the 1-shot and 5-shot scenarios, respectively, with different columns showing the results for different datasets. The color intensity is used to visualize the level of accuracy, where darker shades indicate higher accuracy.
  • Figure 5: Comparison of fine-grained image data augmented by Diffusion-Based Models. The first row depicts the source images, the second and third rows demonstrate the results generated by part of existing methods, and the last row exhibits our results.
  • ...and 2 more figures