Stable Diffusion Dataset Generation for Downstream Classification Tasks
Eugenio Lomurno, Matteo D'Oria, Matteo Matteucci
TL;DR
Addresses the challenge of creating high-information synthetic datasets for downstream classification using diffusion models. Proposes a class-conditioned adaptation of Stable Diffusion 2.0 via a Class-Encoder and a four-step pipeline: Class-Encoder transfer learning, hyper-parameter optimisation of $IS \in [5,50]$ and $UGS \in [0,7.5]$, diffusion fine-tuning, and final optimisation to generate datasets from 1 to 10x the real size. Demonstrates that synthetic-data trained classifiers reach competitive CAS and, in some datasets, outperform those trained on real data, while reducing per-sample generation time. Highlights the potential of synthetic data for data-scarce and privacy-sensitive applications and outlines future directions including data filtering, post-processing, and active learning.
Abstract
Recent advances in generative artificial intelligence have enabled the creation of high-quality synthetic data that closely mimics real-world data. This paper explores the adaptation of the Stable Diffusion 2.0 model for generating synthetic datasets, using Transfer Learning, Fine-Tuning and generation parameter optimisation techniques to improve the utility of the dataset for downstream classification tasks. We present a class-conditional version of the model that exploits a Class-Encoder and optimisation of key generation parameters. Our methodology led to synthetic datasets that, in a third of cases, produced models that outperformed those trained on real datasets.
