Table of Contents
Fetching ...

Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky

TL;DR

The paper addresses the challenge of using synthetic data for fine-grained classification with few real examples, where fine-tuning T2I models can overfit and erode diversity. It proposes BOB, a two-stage approach that (1) preserves context during fine-tuning by enriching text prompts with class-agnostic background and pose, and (2) marginalizes context during generation by sampling across the dataset to approximate the interventional distribution $P(X|do(Y))$. Extensive experiments across backbones, T2I models, and datasets show that BOB achieves state-of-the-art or near state-of-the-art performance in low-shot FGVC, with notable gains on Aircraft and competitive results on long-tail tasks. The work demonstrates that integrating caption-driven context at both the fine-tuning and generation stages improves alignment with real data distributions and enhances downstream classifier performance, highlighting the practical impact of context-aware synthetic data for robust fine-grained recognition.

Abstract

Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

TL;DR

The paper addresses the challenge of using synthetic data for fine-grained classification with few real examples, where fine-tuning T2I models can overfit and erode diversity. It proposes BOB, a two-stage approach that (1) preserves context during fine-tuning by enriching text prompts with class-agnostic background and pose, and (2) marginalizes context during generation by sampling across the dataset to approximate the interventional distribution . Extensive experiments across backbones, T2I models, and datasets show that BOB achieves state-of-the-art or near state-of-the-art performance in low-shot FGVC, with notable gains on Aircraft and competitive results on long-tail tasks. The work demonstrates that integrating caption-driven context at both the fine-tuning and generation stages improves alignment with real data distributions and enhances downstream classifier performance, highlighting the practical impact of context-aware synthetic data for robust fine-grained recognition.

Abstract

Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

Paper Structure

This paper contains 19 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of BOB. We extract background and pose attributes from training images using a captioning model (Step 1), apply context preservation by fine-tuning the T2I model with enriched captions containing class names and context attributes (Step 2), and then perform context marginalization by generating synthetic data through randomly sampling background-pose pairs across the entire dataset (Step 3-4). This preserves class-relevant features while reducing spurious class-context associations.
  • Figure 2: Causal graph of generative process.
  • Figure 3: Visualizations.left. 737-400 images from real data and synthetic data generated by Diff-II, DataDream, and BOB (ours). Diff-II generates images with aircrafts with high contrast in simple backgrounds. DataDream generates more realistic aircrafts that are only on the ground. Our method BOB generate realistic aircrafts in very diverse settings such as taking off, flying, or on the ground with mountainous background, resulting in images that are visually similar to real images.
  • Figure 4: Density plot of FID of synthetic data against the real data for each class.
  • Figure 5: Classification accuracy of caption model vs. downstream classifier trained on synthetic data from BOB.
  • ...and 1 more figures