Black-box Membership Inference Attacks against Fine-tuned Diffusion Models
Yan Pang, Tianhao Wang
TL;DR
This work tackles privacy leakage in fine-tuned diffusion models by proposing a reconstruction-based black-box membership inference framework that relies on similarity between user queries and model-generated outputs. Leveraging a theoretically grounded link between diffusion model training objectives and output memorization, the authors implement a four-scenario attack family with three inference models (threshold-based, distribution-based, classifier-based) and validate them with shadow models using auxiliary datasets. Empirical results on Stable Diffusion v1-5 fine-tuned with CelebA-Dialog, WIT, and MS COCO show ROC-AUC up to $0.95$ and strong performance even under limited query budgets, with robustness across encoders and prompts; DP-SGD defense significantly mitigates leakage. The work highlights tangible privacy risks in open-source downstream fine-tuning and suggests practical auditing and defense directions to curb data memorization in diffusion-based systems.
Abstract
With the rapid advancement of diffusion-based image-generative models, the quality of generated images has become increasingly photorealistic. Moreover, with the release of high-quality pre-trained image-generative models, a growing number of users are downloading these pre-trained models to fine-tune them with downstream datasets for various image-generation tasks. However, employing such powerful pre-trained models in downstream tasks presents significant privacy leakage risks. In this paper, we propose the first reconstruction-based membership inference attack framework, tailored for recent diffusion models, and in the more stringent black-box access setting. Considering four distinct attack scenarios and three types of attacks, this framework is capable of targeting any popular conditional generator model, achieving high precision, evidenced by an impressive AUC of $0.95$.
