Attributed Synthetic Data Generation for Zero-shot Domain-specific Image Classification
Shijian Wang, Linxin Song, Ryotaro Shimizu, Masayuki Goto, Hanqian Wu
TL;DR
This work tackles zero-shot domain-specific image classification by enriching synthetic training data with diverse attributed prompts. AttrSyn leverages large language models to generate class-dependent and class-independent attribute concepts and values, which are then combined with class names to produce varied prompts for a text-to-image model, such as Stable Diffusion XL. Training classifiers on these attributed synthetic images yields substantial gains over simple prompt baselines and CLIP zero-shot on two fine-grained datasets, with improvements up to 13.62 percentage points. Overall, the approach demonstrates that diversity-aware, attribute-driven synthetic data can meaningfully close the gap between synthetic and real images in zero-shot, domain-specific contexts.
Abstract
Zero-shot domain-specific image classification is challenging in classifying real images without ground-truth in-domain training examples. Recent research involved knowledge from texts with a text-to-image model to generate in-domain training images in zero-shot scenarios. However, existing methods heavily rely on simple prompt strategies, limiting the diversity of synthetic training images, thus leading to inferior performance compared to real images. In this paper, we propose AttrSyn, which leverages large language models to generate attributed prompts. These prompts allow for the generation of more diverse attributed synthetic images. Experiments for zero-shot domain-specific image classification on two fine-grained datasets show that training with synthetic images generated by AttrSyn significantly outperforms CLIP's zero-shot classification under most situations and consistently surpasses simple prompt strategies.
