Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

Eugenio Lomurno; Matteo Matteucci

Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

Eugenio Lomurno, Matteo Matteucci

TL;DR

The paper tackles the challenge of training high-performance classifiers from synthetic data while preserving privacy. It proposes Knowledge Recycling (KR), a pipeline that uses Generative Knowledge Distillation (GKD) to pass soft-label information from a Teacher Classifier to a Student Classifier trained on regenerated synthetic data, and it optimizes generation via Checkpoint Optimisation and Tuning. Across $9$ datasets, including six MedMNIST medical datasets, KR narrows the performance gap with real data (average CAS gap of $-1.24$ percentage points) and yields near-immunity to Membership Inference Attacks, improving privacy without sacrificing accuracy in many cases. These results suggest a practical route to privacy-preserving, synthetic-data–driven learning in clinical contexts and beyond, with scalable potential through more Generators and higher-resolution data.

Abstract

Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.

Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

TL;DR

datasets, including six MedMNIST medical datasets, KR narrows the performance gap with real data (average CAS gap of

percentage points) and yields near-immunity to Membership Inference Attacks, improving privacy without sacrificing accuracy in many cases. These results suggest a practical route to privacy-preserving, synthetic-data–driven learning in clinical contexts and beyond, with scalable potential through more Generators and higher-resolution data.

Abstract

Paper Structure (14 sections, 4 figures, 5 tables)

This paper contains 14 sections, 4 figures, 5 tables.

Introduction
Related Works
Privacy Threats and Countermeasures
Method
Teacher Classifier
Generator
Evaluation Metric
Checkpoint Optimisation
Tuning
Membership Inference Attack
Experiments and Results
Discussion and Limitations
Conclusions
Acknowledgements

Figures (4)

Figure 1: The difference between an Ordinary Training and the proposed Generative Knowledge Distillation technique, and the illustration of the Knowledge Recycling pipeline.
Figure 2: The Classification Accuracy Score (CAS) of the validation calculated for each checkpoint of the Generator for the considered datasets. The continuous blue line represents the CAS obtained during the Checkpoint Optimisation step using the Generative Knowledge Distillation technique. The best checkpoint is marked with a blue star. The dashed grey line represents the best validation Accuracy obtained with the Teacher Classifier. The red star indicates the optimal checkpoint CAS of the validation after training with Generative Knowledge Distillation with parameters found during the Tuning step.
Figure 3: The Classification Accuracy Score (CAS) of the validation calculated for each checkpoint of the BigGAN-Deep (vanilla) and BigGAN-Deep (ours) generators for the considered datasets.
Figure 4: The Classification Accuracy Score (CAS) of the validation calculated for each checkpoint of the BigGAN-Deep (ours) generator for the considered datasets. The comparison is made between an "ordinary" strategy with a single dataset generation at the beginning of each training (Baseline), the approach presented by Lampis et al. (Gap Filler), and the one proposed in this work (Generative Knowledge Distillation).

Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

TL;DR

Abstract

Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)