Table of Contents
Fetching ...

Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

Eyal Michaeli, Ohad Fried

TL;DR

This paper tackles the data scarcity and fine-grained fidelity challenges in FGVC by proposing SaSPA, a diffusion-based augmentation pipeline that does not rely on real images for guidance. SaSPA conditions generation on abstract structure via edge maps and on explicit subject representations, using GPT-4 to produce diverse meta-class prompts and applying CLIP-based semantic filtering together with predictive-confidence filtering to ensure quality. It demonstrates consistent improvements over traditional and concurrent generative baselines across full-dataset, few-shot, and bias-mitigated FGVC tasks, and provides insights into how to balance synthetic and real data (e.g., increasing synthetic data when real data is scarce). The method relies on a combination of ControlNet with BLIP-diffusion, edge-based conditioning, and careful prompting and filtering, establishing a practical, scalable approach for FGVC data augmentation with implications for broader recognition tasks; notable variables include $M=2$ augmentations per image and an augmentation ratio $\alpha$ typically in $[0.2,0.5]$ ($\alpha=0.4$ default).

Abstract

Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on Text2Image generation or Img2Img methods, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data. Code is available at https://github.com/EyalMichaeli/SaSPA-Aug.

Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

TL;DR

This paper tackles the data scarcity and fine-grained fidelity challenges in FGVC by proposing SaSPA, a diffusion-based augmentation pipeline that does not rely on real images for guidance. SaSPA conditions generation on abstract structure via edge maps and on explicit subject representations, using GPT-4 to produce diverse meta-class prompts and applying CLIP-based semantic filtering together with predictive-confidence filtering to ensure quality. It demonstrates consistent improvements over traditional and concurrent generative baselines across full-dataset, few-shot, and bias-mitigated FGVC tasks, and provides insights into how to balance synthetic and real data (e.g., increasing synthetic data when real data is scarce). The method relies on a combination of ControlNet with BLIP-diffusion, edge-based conditioning, and careful prompting and filtering, establishing a practical, scalable approach for FGVC data augmentation with implications for broader recognition tasks; notable variables include augmentations per image and an augmentation ratio typically in ( default).

Abstract

Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on Text2Image generation or Img2Img methods, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data. Code is available at https://github.com/EyalMichaeli/SaSPA-Aug.
Paper Structure (47 sections, 8 figures, 21 tables)

This paper contains 47 sections, 8 figures, 21 tables.

Figures (8)

  • Figure 1: Various generative augmentation methods applied on Aircraft maji2013fineaircraft. Text-to-image often compromises class fidelity, visible by the unrealistic aircraft design (i.e., tail at both ends). Img2Img trades off fidelity and diversity: lower strength (e.g., 0.5) introduces minimal semantic changes, resulting in higher fidelity but limited diversity, whereas higher strength (e.g., 0.75) introduces diversity but also inaccuracies such as the incorrectly added engine. In contrast, SaSPA achieves high fidelity and diversity, critical for Fine-Grained Visual Classification tasks. D - Diversity. F - Fidelity
  • Figure 2: SaSPA Pipeline: For a given FGVC dataset, we generate prompts via GPT-4 based on the meta-class. Each real image undergoes edge detection to provide structural outlines. These edges are used $M$ times, each time with a different prompt and a different subject reference image from the same sub-class, as inputs to a ControlNet with BLIP-Diffusion as the base model. The generated images are then filtered using a dataset-trained model and CLIP to ensure relevance and quality.
  • Figure 3: Example augmentations using our method (SaSPA). The {} placeholder represents the specific sub-class.
  • Figure 4: Figure 4: Few-shot test accuracy across three FGVC datasets: Aircraft, Cars, and DTD, using different augmentation methods. The number of few-shots tested includes 4, 8, 12, and 16. We can see that for all datasets and shots, SaSPA outperforms all other augmentation methods.
  • Figure 5: Line plots of Augmentation Ratio ($\alpha$) vs. validation accuracy for Aircraft, Cars, DTD, and CUB datasets.
  • ...and 3 more figures