Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions
Tianxu Wu, Shuo Ye, Shuhuang Chen, Qinmu Peng, Xinge You
TL;DR
This work tackles fine-grained visual categorization under few-shot constraints by integrating a diffusion-based augmentation framework, DRDM, with two key modules: Discriminative Semantic Recombination (DSR) and Spatial Knowledge Reference (SKR). DSR constrains diffusion-generated data using label-text semantic relationships via adapters and CLIP alignment to preserve discriminative details, while SKR anchors high-dimensional feature distributions with external datasets to expand decision boundaries. Together, they reduce feature contamination and improve class separability, achieving consistent gains over state-of-the-art methods across CUB, Dogs, and Cars in 1-shot and 5-shot settings. The approach demonstrates strong transferability to other FGSL models and highlights the value of leveraging cross-dataset knowledge for robust few-shot fine-grained recognition in practical applications.
Abstract
The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic mapping between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.
