Stable Diffusion Dataset Generation for Downstream Classification Tasks

Eugenio Lomurno; Matteo D'Oria; Matteo Matteucci

Stable Diffusion Dataset Generation for Downstream Classification Tasks

Eugenio Lomurno, Matteo D'Oria, Matteo Matteucci

TL;DR

Addresses the challenge of creating high-information synthetic datasets for downstream classification using diffusion models. Proposes a class-conditioned adaptation of Stable Diffusion 2.0 via a Class-Encoder and a four-step pipeline: Class-Encoder transfer learning, hyper-parameter optimisation of $IS \in [5,50]$ and $UGS \in [0,7.5]$, diffusion fine-tuning, and final optimisation to generate datasets from 1 to 10x the real size. Demonstrates that synthetic-data trained classifiers reach competitive CAS and, in some datasets, outperform those trained on real data, while reducing per-sample generation time. Highlights the potential of synthetic data for data-scarce and privacy-sensitive applications and outlines future directions including data filtering, post-processing, and active learning.

Abstract

Recent advances in generative artificial intelligence have enabled the creation of high-quality synthetic data that closely mimics real-world data. This paper explores the adaptation of the Stable Diffusion 2.0 model for generating synthetic datasets, using Transfer Learning, Fine-Tuning and generation parameter optimisation techniques to improve the utility of the dataset for downstream classification tasks. We present a class-conditional version of the model that exploits a Class-Encoder and optimisation of key generation parameters. Our methodology led to synthetic datasets that, in a third of cases, produced models that outperformed those trained on real datasets.

Stable Diffusion Dataset Generation for Downstream Classification Tasks

TL;DR

and

, diffusion fine-tuning, and final optimisation to generate datasets from 1 to 10x the real size. Demonstrates that synthetic-data trained classifiers reach competitive CAS and, in some datasets, outperform those trained on real data, while reducing per-sample generation time. Highlights the potential of synthetic data for data-scarce and privacy-sensitive applications and outlines future directions including data filtering, post-processing, and active learning.

Abstract

Paper Structure (5 sections, 1 figure, 2 tables)

This paper contains 5 sections, 1 figure, 2 tables.

Introduction
Related Works
Method
Experiments and Results
Conclusions and Future Directions

Figures (1)

Figure 1: The hyper-parametes average importance for the two optimisation steps.

Stable Diffusion Dataset Generation for Downstream Classification Tasks

TL;DR

Abstract

Stable Diffusion Dataset Generation for Downstream Classification Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (1)