Table of Contents
Fetching ...

Your Image Generator Is Your New Private Dataset

Nicolo Resmini, Eugenio Lomurno, Cristian Sbrolli, Matteo Matteucci

TL;DR

The paper addresses data scarcity and privacy concerns in image classification by generating high-fidelity synthetic data using a text-conditioned diffusion pipeline. The proposed Text-Conditioned Knowledge Recycling (TCKR) integrates dynamic BLIP-2 captions, LoRA-based diffusion adaptation, and Generative Knowledge Distillation to produce informative training samples and soft labels for a Student classifier. Empirical results across ten benchmarks show that models trained solely on TCKR data can match or exceed real-data performance while substantially reducing Membership Inference Attack risk, with an average AUC$_{MIA}$ reduction of $5.49$ points and an average AOP increase of $9.58$ points; moderate synthetic-data scales often yield the best privacy-utility balance. The work demonstrates the viability of privacy-preserving synthetic data as a substitute for real imagery in classifier training and highlights scaling behavior and practical trade-offs for real-world deployment. Code and trained models are released in an open-source repository.

Abstract

Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.

Your Image Generator Is Your New Private Dataset

TL;DR

The paper addresses data scarcity and privacy concerns in image classification by generating high-fidelity synthetic data using a text-conditioned diffusion pipeline. The proposed Text-Conditioned Knowledge Recycling (TCKR) integrates dynamic BLIP-2 captions, LoRA-based diffusion adaptation, and Generative Knowledge Distillation to produce informative training samples and soft labels for a Student classifier. Empirical results across ten benchmarks show that models trained solely on TCKR data can match or exceed real-data performance while substantially reducing Membership Inference Attack risk, with an average AUC reduction of points and an average AOP increase of points; moderate synthetic-data scales often yield the best privacy-utility balance. The work demonstrates the viability of privacy-preserving synthetic data as a substitute for real imagery in classifier training and highlights scaling behavior and practical trade-offs for real-world deployment. Code and trained models are released in an open-source repository.

Abstract

Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.

Paper Structure

This paper contains 16 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Graphical abstract of the Text-Conditioned Knowledge Recycling (TCKR) pipeline summarising the entire process employed to generate synthetic datasets. Initially, for each image, a specific caption is produced using the BLIP‑2 model. Thereafter, the generator, based on Stable Diffusion 2.0, is adapted to the target domain using the LoRA technique and is conditioned via prompts that combine the class name with the caption (formatted as “n: c”). Finally, the Generative Knowledge Distillation transfers the knowledge from the teacher classifier to the student classifier, thereby enabling the development of models that achieve high classification performance whilst ensuring enhanced privacy protection.
  • Figure 2: Classification Accuracy Score (CAS) of the Student Classifier for different synthetic dataset cardinalities (left panel). The star marker indicates the Student’s peak CAS on each dataset. The right panel shows the average CAS improvement observed when increasing the synthetic dataset size from one cardinality to the next, averaged across all datasets. For reference, each horizontal dashed line denotes the performance of the corresponding Teacher Classifier.
  • Figure 3: Area Under the ROC Curve of the MIA (AUC$_{MIA}$) for Student Classifiers trained on synthetic datasets of various cardinalities (left panel). The star marker indicates the Student’s lowest AUC$_{MIA}$ (best privacy) achieved. The right panel shows the average AUC$_{MIA}$ increase between successive cardinality levels, averaged across all datasets. For each dataset, the horizontal dashed line represents the AUC$_{MIA}$ of the Teacher Classifier.
  • Figure 4: Accuracy Over Privacy (AOP) scores for Student Classifiers at different synthetic dataset cardinalities (left panel), with star markers indicating each Student’s highest AOP. The right panel shows the average AOP change between successive cardinality increases across all datasets. For each dataset, the horizontal dashed line represents the AOP of the corresponding Teacher Classifier.
  • Figure 5: Radar charts comparing CAS, AUC$_{MIA}$, and AOP for all datasets at each synthetic dataset size. Each chart plots the raw values of the three metrics (higher CAS and AOP, and lower AUC$_{MIA}$, are better). Note that lower AUC$_{MIA}$ values (closer to 50) indicate stronger privacy protection.
  • ...and 5 more figures