Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky
TL;DR
The paper addresses the challenge of using synthetic data for fine-grained classification with few real examples, where fine-tuning T2I models can overfit and erode diversity. It proposes BOB, a two-stage approach that (1) preserves context during fine-tuning by enriching text prompts with class-agnostic background and pose, and (2) marginalizes context during generation by sampling across the dataset to approximate the interventional distribution $P(X|do(Y))$. Extensive experiments across backbones, T2I models, and datasets show that BOB achieves state-of-the-art or near state-of-the-art performance in low-shot FGVC, with notable gains on Aircraft and competitive results on long-tail tasks. The work demonstrates that integrating caption-driven context at both the fine-tuning and generation stages improves alignment with real data distributions and enhances downstream classifier performance, highlighting the practical impact of context-aware synthetic data for robust fine-grained recognition.
Abstract
Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.
