Table of Contents
Fetching ...

Attributed Synthetic Data Generation for Zero-shot Domain-specific Image Classification

Shijian Wang, Linxin Song, Ryotaro Shimizu, Masayuki Goto, Hanqian Wu

TL;DR

This work tackles zero-shot domain-specific image classification by enriching synthetic training data with diverse attributed prompts. AttrSyn leverages large language models to generate class-dependent and class-independent attribute concepts and values, which are then combined with class names to produce varied prompts for a text-to-image model, such as Stable Diffusion XL. Training classifiers on these attributed synthetic images yields substantial gains over simple prompt baselines and CLIP zero-shot on two fine-grained datasets, with improvements up to 13.62 percentage points. Overall, the approach demonstrates that diversity-aware, attribute-driven synthetic data can meaningfully close the gap between synthetic and real images in zero-shot, domain-specific contexts.

Abstract

Zero-shot domain-specific image classification is challenging in classifying real images without ground-truth in-domain training examples. Recent research involved knowledge from texts with a text-to-image model to generate in-domain training images in zero-shot scenarios. However, existing methods heavily rely on simple prompt strategies, limiting the diversity of synthetic training images, thus leading to inferior performance compared to real images. In this paper, we propose AttrSyn, which leverages large language models to generate attributed prompts. These prompts allow for the generation of more diverse attributed synthetic images. Experiments for zero-shot domain-specific image classification on two fine-grained datasets show that training with synthetic images generated by AttrSyn significantly outperforms CLIP's zero-shot classification under most situations and consistently surpasses simple prompt strategies.

Attributed Synthetic Data Generation for Zero-shot Domain-specific Image Classification

TL;DR

This work tackles zero-shot domain-specific image classification by enriching synthetic training data with diverse attributed prompts. AttrSyn leverages large language models to generate class-dependent and class-independent attribute concepts and values, which are then combined with class names to produce varied prompts for a text-to-image model, such as Stable Diffusion XL. Training classifiers on these attributed synthetic images yields substantial gains over simple prompt baselines and CLIP zero-shot on two fine-grained datasets, with improvements up to 13.62 percentage points. Overall, the approach demonstrates that diversity-aware, attribute-driven synthetic data can meaningfully close the gap between synthetic and real images in zero-shot, domain-specific contexts.

Abstract

Zero-shot domain-specific image classification is challenging in classifying real images without ground-truth in-domain training examples. Recent research involved knowledge from texts with a text-to-image model to generate in-domain training images in zero-shot scenarios. However, existing methods heavily rely on simple prompt strategies, limiting the diversity of synthetic training images, thus leading to inferior performance compared to real images. In this paper, we propose AttrSyn, which leverages large language models to generate attributed prompts. These prompts allow for the generation of more diverse attributed synthetic images. Experiments for zero-shot domain-specific image classification on two fine-grained datasets show that training with synthetic images generated by AttrSyn significantly outperforms CLIP's zero-shot classification under most situations and consistently surpasses simple prompt strategies.

Paper Structure

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overall workflow of AttrSyn. In the attribute concept generation stage, for a given dataset, high-quality attribute concepts are derived by querying a large language model and human interactive filtering. In the attribute generation stage, the obtained attribute concepts are categorized into class-dependent and class-independent concepts, and different query strategies are adopted to generate diverse attribute candidate values. In the attributed image generation stage, attribute candidate values from various attribute concepts are randomly selected and combined with class names to create diverse attributed prompts, which are subsequently sent to a text-to-image model to produce the corresponding attributed images.
  • Figure 2: Visualization of synthetic images generated by the base prompt and AttrSyn for the black-footed albatross class. AttrSyn produces more diverse images compared to the base prompt in both the photo and painting domains.
  • Figure 3: Test performances of different scales of synthetic training data generated by our AttrSyn method.