Table of Contents
Fetching ...

HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models

Zhiguang Lu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang

TL;DR

HiGFA tackles FGVC data scarcity by integrating three guidance streams—text prompts for global diversity, transformed contour maps for structure, and a fine-grained classifier for category fidelity—within a diffusion model. By leveraging the diffusion process’s coarse-to-fine generation, HiGFA uses a dynamic, confidence-aware scheduling that activates fine-grained guidance only when needed, preserving diversity while maintaining fidelity. Empirical results across six FGVC benchmarks, including few-shot settings and ViT backbones, show consistent improvements over traditional augmentations and prior diffusion-based methods. The work demonstrates that hierarchical, adaptive guidance can produce high-quality, diverse FGVC synthetic data, boosting downstream classifier performance.

Abstract

Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.

HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models

TL;DR

HiGFA tackles FGVC data scarcity by integrating three guidance streams—text prompts for global diversity, transformed contour maps for structure, and a fine-grained classifier for category fidelity—within a diffusion model. By leveraging the diffusion process’s coarse-to-fine generation, HiGFA uses a dynamic, confidence-aware scheduling that activates fine-grained guidance only when needed, preserving diversity while maintaining fidelity. Empirical results across six FGVC benchmarks, including few-shot settings and ViT backbones, show consistent improvements over traditional augmentations and prior diffusion-based methods. The work demonstrates that hierarchical, adaptive guidance can produce high-quality, diverse FGVC synthetic data, boosting downstream classifier performance.

Abstract

Generative diffusion models show promise for data augmentation. However, applying them to fine-grained tasks presents a significant challenge: ensuring synthetic images accurately capture the subtle, category-defining features critical for high fidelity. Standard approaches, such as text-based Classifier-Free Guidance (CFG), often lack the required specificity, potentially generating misleading examples that degrade fine-grained classifier performance. To address this, we propose Hierarchically Guided Fine-grained Augmentation (HiGFA). HiGFA leverages the temporal dynamics of the diffusion sampling process. It employs strong text and transformed contour guidance with fixed strengths in the early-to-mid sampling stages to establish overall scene, style, and structure. In the final sampling stages, HiGFA activates a specialized fine-grained classifier guidance and dynamically modulates the strength of all guidance signals based on prediction confidence. This hierarchical, confidence-driven orchestration enables HiGFA to generate diverse yet faithful synthetic images by intelligently balancing global structure formation with precise detail refinement. Experiments on several FGVC datasets demonstrate the effectiveness of HiGFA.

Paper Structure

This paper contains 45 sections, 7 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Accurately depicting the red shoulder patches of the Red-winged Blackbird is essential for fine-grained classification. More examples are provided in Appendix.
  • Figure 2: An overview of our HiGFA:a) Existing diffusion-based augmentation method is insufficient to ensure category fidelity. b) Incorporating contour guidance helps preserve scene layout and object structure, but struggles with fine-grained category consistency and limit diversity. To enhance diversity, we apply diversity enhancement such as flipping, rotation, and thin-plate spline interpolation. c) To improve fine-grained category consistency, in the later stages, our adaptive strategy incorporates classifier guidance to balance the scale of all three guidance on a sample-wise level. Without classifier guidance, the generation process tends to converge toward the mean of the class distribution. d) Under the combination of hierarchical guidance and dynamic adjustment, our method generates augmented images with both high category fidelity and enhanced diversity.
  • Figure 3: Performance of methods on the FGVC Aircraft, Stanford Cars, CUB, and Stanford Dogs under few-shot settings. The results indicate that our method remains effective even when the classifier is not very reliable.
  • Figure 4: Qualitative comparisons between images generated by HiGFA and some baseline generative augmentation methods for fine-grained categories. These comparisons highlight HiGFA's ability to generate diverse images while also better preserving subtle category-defining details. The complete results are provided in Appendix.
  • Figure 5: Evolution of the guidance scale during generation. The red dashed line indicates the activation of classifier guidance and the dynamic mechanism. Easy samples with clean backgrounds need only brief classifier signals to guide generation, while hard samples with complex backgrounds and conflicting prompts require extended guidance for generation, highlighting the sample-wise adaptability of our dynamic guidance.
  • ...and 6 more figures