Table of Contents
Fetching ...

DP-LDMs: Differentially Private Latent Diffusion Models

Michael F. Liu, Saiyue Lyu, Margarita Vinaroz, Mijung Park

TL;DR

The paper tackles the privacy risks of diffusion models by introducing DP-LDMs, which pretrain autoencoders and latent diffusion models on public data and privately fine-tune only the attention modules (and conditioning embedder) with DP-SGD on private data. This two-stage approach dramatically reduces trainable parameters while delivering high-resolution, conditioned image generation with differential privacy guarantees. Across multiple benchmarks, DP-LDMs achieve competitive or superior DP-utility trade-offs (as measured by FID and downstream task performance) and require substantially fewer computational resources than full-model DP fine-tuning. The work demonstrates that attention-level adaptation in latent diffusion spaces can effectively bridge domain shifts under DP constraints, offering a practical path for private generative modeling at scale.

Abstract

Diffusion models (DMs) are one of the most widely used generative models for producing high quality images. However, a flurry of recent papers points out that DMs are least private forms of image generators, by extracting a significant number of near-identical replicas of training images from DMs. Existing privacy-enhancing techniques for DMs, unfortunately, do not provide a good privacy-utility tradeoff. In this paper, we aim to improve the current state of DMs with differential privacy (DP) by adopting the $\textit{Latent}$ Diffusion Models (LDMs). LDMs are equipped with powerful pre-trained autoencoders that map the high-dimensional pixels into lower-dimensional latent representations, in which DMs are trained, yielding a more efficient and fast training of DMs. Rather than fine-tuning the entire LDMs, we fine-tune only the $\textit{attention}$ modules of LDMs with DP-SGD, reducing the number of trainable parameters by roughly $90\%$ and achieving a better privacy-accuracy trade-off. Our approach allows us to generate realistic, high-dimensional images (256x256) conditioned on text prompts with DP guarantees, which, to the best of our knowledge, has not been attempted before. Our approach provides a promising direction for training more powerful, yet training-efficient differentially private DMs, producing high-quality DP images. Our code is available at https://anonymous.4open.science/r/DP-LDM-4525.

DP-LDMs: Differentially Private Latent Diffusion Models

TL;DR

The paper tackles the privacy risks of diffusion models by introducing DP-LDMs, which pretrain autoencoders and latent diffusion models on public data and privately fine-tune only the attention modules (and conditioning embedder) with DP-SGD on private data. This two-stage approach dramatically reduces trainable parameters while delivering high-resolution, conditioned image generation with differential privacy guarantees. Across multiple benchmarks, DP-LDMs achieve competitive or superior DP-utility trade-offs (as measured by FID and downstream task performance) and require substantially fewer computational resources than full-model DP fine-tuning. The work demonstrates that attention-level adaptation in latent diffusion spaces can effectively bridge domain shifts under DP constraints, offering a practical path for private generative modeling at scale.

Abstract

Diffusion models (DMs) are one of the most widely used generative models for producing high quality images. However, a flurry of recent papers points out that DMs are least private forms of image generators, by extracting a significant number of near-identical replicas of training images from DMs. Existing privacy-enhancing techniques for DMs, unfortunately, do not provide a good privacy-utility tradeoff. In this paper, we aim to improve the current state of DMs with differential privacy (DP) by adopting the Diffusion Models (LDMs). LDMs are equipped with powerful pre-trained autoencoders that map the high-dimensional pixels into lower-dimensional latent representations, in which DMs are trained, yielding a more efficient and fast training of DMs. Rather than fine-tuning the entire LDMs, we fine-tune only the modules of LDMs with DP-SGD, reducing the number of trainable parameters by roughly and achieving a better privacy-accuracy trade-off. Our approach allows us to generate realistic, high-dimensional images (256x256) conditioned on text prompts with DP guarantees, which, to the best of our knowledge, has not been attempted before. Our approach provides a promising direction for training more powerful, yet training-efficient differentially private DMs, producing high-quality DP images. Our code is available at https://anonymous.4open.science/r/DP-LDM-4525.
Paper Structure (29 sections, 12 equations, 13 figures, 24 tables, 1 algorithm)

This paper contains 29 sections, 12 equations, 13 figures, 24 tables, 1 algorithm.

Figures (13)

  • Figure 1: A schematic of DP-LDM. In the non-private step, we pre-train the auto-encoder depicted in yellow (Right and Left) with public data. We then forward pass the public data through the encoder (blue arrow on the left) to obtain latent representations. We then train the diffusion model (depicted in the green box) on the lower-dimensional latent representations. The diffusion model consists of the UNet backbone and added attention modules (in Red) with a conditioning embedder (in Red, at top-right corner). In the private step, we forward pass the private data (red arrow on the left) through the encoder to obtain latent representations of the private data. We then fine-tune only the red blocks, which are attention modules and conditioning embedder, with DP-SGD. Once the training is done, we sample the latent representations from the diffusion model, and pass them through the decoder to obtain the samples in the pixel space.
  • Figure 2: (a) SpatialTransformer Block; (b) AttentionBlock
  • Figure 3: Text-to-image generation of $256 \times 256$ CelebAHQ with prompts at $\epsilon=10$. FID: 15.6
  • Figure 4: Synthetic $256 \times 256$ CelebA samples generated at varying $\epsilon$. Samples for DP-MEPF are generated from code available in DP-MEPF. We computed FID between our generated samples and the real data and achieve FIDs of $19.0 \pm 0.0$ at $\epsilon=10$, $20.5 \pm 0.1$ at $\epsilon=5$, and $25.6 \pm 0.1$ at $\epsilon=1$. DP-MEPF achieves an FID of $41.8$ at $\epsilon=10$ and $101.5$ at $\epsilon=1$.
  • Figure 5: The Q,K,V matrices in cross-attention modules in two fine-tuned models for CIFAR10 at $\epsilon=10$. Fine-tuned(0-15) indicates attention modules of all layers are fine-tuned, while Fine-tuned(9-15) indicates only the attention modules at layers 9-15 are fine-tuned. (a) Each row corresponds to $W_q$(top), $W_k$(middle), and $W_v$(bottom). (b) Pink represents $\Delta W_q$, blue $\Delta W_k$, and green $\Delta W_v$, respectively. Solid lines represent the model(9-15) and dotted lines represent the model(0-15).
  • ...and 8 more figures