Table of Contents
Fetching ...

Black-box Membership Inference Attacks against Fine-tuned Diffusion Models

Yan Pang, Tianhao Wang

TL;DR

This work tackles privacy leakage in fine-tuned diffusion models by proposing a reconstruction-based black-box membership inference framework that relies on similarity between user queries and model-generated outputs. Leveraging a theoretically grounded link between diffusion model training objectives and output memorization, the authors implement a four-scenario attack family with three inference models (threshold-based, distribution-based, classifier-based) and validate them with shadow models using auxiliary datasets. Empirical results on Stable Diffusion v1-5 fine-tuned with CelebA-Dialog, WIT, and MS COCO show ROC-AUC up to $0.95$ and strong performance even under limited query budgets, with robustness across encoders and prompts; DP-SGD defense significantly mitigates leakage. The work highlights tangible privacy risks in open-source downstream fine-tuning and suggests practical auditing and defense directions to curb data memorization in diffusion-based systems.

Abstract

With the rapid advancement of diffusion-based image-generative models, the quality of generated images has become increasingly photorealistic. Moreover, with the release of high-quality pre-trained image-generative models, a growing number of users are downloading these pre-trained models to fine-tune them with downstream datasets for various image-generation tasks. However, employing such powerful pre-trained models in downstream tasks presents significant privacy leakage risks. In this paper, we propose the first reconstruction-based membership inference attack framework, tailored for recent diffusion models, and in the more stringent black-box access setting. Considering four distinct attack scenarios and three types of attacks, this framework is capable of targeting any popular conditional generator model, achieving high precision, evidenced by an impressive AUC of $0.95$.

Black-box Membership Inference Attacks against Fine-tuned Diffusion Models

TL;DR

This work tackles privacy leakage in fine-tuned diffusion models by proposing a reconstruction-based black-box membership inference framework that relies on similarity between user queries and model-generated outputs. Leveraging a theoretically grounded link between diffusion model training objectives and output memorization, the authors implement a four-scenario attack family with three inference models (threshold-based, distribution-based, classifier-based) and validate them with shadow models using auxiliary datasets. Empirical results on Stable Diffusion v1-5 fine-tuned with CelebA-Dialog, WIT, and MS COCO show ROC-AUC up to and strong performance even under limited query budgets, with robustness across encoders and prompts; DP-SGD defense significantly mitigates leakage. The work highlights tangible privacy risks in open-source downstream fine-tuning and suggests practical auditing and defense directions to curb data memorization in diffusion-based systems.

Abstract

With the rapid advancement of diffusion-based image-generative models, the quality of generated images has become increasingly photorealistic. Moreover, with the release of high-quality pre-trained image-generative models, a growing number of users are downloading these pre-trained models to fine-tune them with downstream datasets for various image-generation tasks. However, employing such powerful pre-trained models in downstream tasks presents significant privacy leakage risks. In this paper, we propose the first reconstruction-based membership inference attack framework, tailored for recent diffusion models, and in the more stringent black-box access setting. Considering four distinct attack scenarios and three types of attacks, this framework is capable of targeting any popular conditional generator model, achieving high precision, evidenced by an impressive AUC of .
Paper Structure (71 sections, 2 theorems, 29 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 71 sections, 2 theorems, 29 equations, 9 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

Assuming we have a pre-trained diffusion model $\hat{x}_{\theta}$We previously use $\mathcal{U}_{\theta}$ to denote U-Net, now by slightly abusing notations we use $\hat{x}_{\theta}$ for easier presentations. with its training set $\mathcal{D}_m$, and use a bit $b$ to represent the membership of que where $\theta$ denotes the parameters of the model.

Figures (9)

  • Figure 1: Our attack takes the query sample $x$, which consists of an image $I_q$ and a text component $T_q$, and applies $T_q$ to query the model to get generated image $I_g$ for $m$ times. Then, we compute the similarity score between $I_q$ and each $I_g$ with $S(\cdot,\cdot)$. The $m$ scores are then aggregated using $f$, and used to train the attack model to determine the membership.
  • Figure 2: Impact of different threshold values on attack results using the ROC-AUC metric across three datasets. Here, $\tau$ represents the best threshold selected for each attack from the shadow model based on the AUC scores.
  • Figure 3: AUC results on three datasets and four attack scenarios comparing five different image feature extractors.
  • Figure 4: Relationship between epoch progression and AUC score in \ref{['Attack-I']}, \ref{['Attack-II']}, \ref{['Attack-III']}, and \ref{['Attack-IV']}, indicating increasing memorization within image generation models over fine-tuning epochs.
  • Figure 5: The inference steps of $30$, $50$, $100$, and $200$ showed no noticeable differences in the overall structure of the generated images. Only subtle details, such as hair, exhibited variations.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof : Proof Sketch
  • Theorem 2
  • proof : Proof Sketch
  • proof
  • proof