Table of Contents
Fetching ...

Going beyond Compositions, DDPMs Can Produce Zero-Shot Interpolations

Justin Deschenaux, Igor Krawczuk, Grigorios Chrysos, Volkan Cevher

TL;DR

This work demonstrates that Denoising Diffusion Probabilistic Models (DDPMs) can produce zero-shot interpolations between extreme attribute values when trained on highly separated subsets of the data. The authors introduce multi-guidance sampling, combining the unconditional diffusion model with multiple attribute classifiers to steer generation toward intermediate expressions, even without intermediate training data. They validate the approach on CelebA and synthetic datasets, showing mild smiles, age transitions, and hair-color interpolations, and they explore data-efficiency, sensitivity to guidance strength, and two-attribute interpolation. The study also discusses extrapolation, limitations, and broader implications for fairness and potential misuse, highlighting diffusion priors as a source of interpolation and outlining future directions for robust, controllable generative modeling. Overall, the work expands our understanding of diffusion-model inductive biases and provides a practical sampling framework for interpolating across latent factors beyond the training distribution.

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) exhibit remarkable capabilities in image generation, with studies suggesting that they can generalize by composing latent factors learned from the training data. In this work, we go further and study DDPMs trained on strictly separate subsets of the data distribution with large gaps on the support of the latent factors. We show that such a model can effectively generate images in the unexplored, intermediate regions of the distribution. For instance, when trained on clearly smiling and non-smiling faces, we demonstrate a sampling procedure which can generate slightly smiling faces without reference images (zero-shot interpolation). We replicate these findings for other attributes as well as other datasets. Our code is available at https://github.com/jdeschena/ddpm-zero-shot-interpolation.

Going beyond Compositions, DDPMs Can Produce Zero-Shot Interpolations

TL;DR

This work demonstrates that Denoising Diffusion Probabilistic Models (DDPMs) can produce zero-shot interpolations between extreme attribute values when trained on highly separated subsets of the data. The authors introduce multi-guidance sampling, combining the unconditional diffusion model with multiple attribute classifiers to steer generation toward intermediate expressions, even without intermediate training data. They validate the approach on CelebA and synthetic datasets, showing mild smiles, age transitions, and hair-color interpolations, and they explore data-efficiency, sensitivity to guidance strength, and two-attribute interpolation. The study also discusses extrapolation, limitations, and broader implications for fairness and potential misuse, highlighting diffusion priors as a source of interpolation and outlining future directions for robust, controllable generative modeling. Overall, the work expands our understanding of diffusion-model inductive biases and provides a practical sampling framework for interpolating across latent factors beyond the training distribution.

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) exhibit remarkable capabilities in image generation, with studies suggesting that they can generalize by composing latent factors learned from the training data. In this work, we go further and study DDPMs trained on strictly separate subsets of the data distribution with large gaps on the support of the latent factors. We show that such a model can effectively generate images in the unexplored, intermediate regions of the distribution. For instance, when trained on clearly smiling and non-smiling faces, we demonstrate a sampling procedure which can generate slightly smiling faces without reference images (zero-shot interpolation). We replicate these findings for other attributes as well as other datasets. Our code is available at https://github.com/jdeschena/ddpm-zero-shot-interpolation.
Paper Structure (69 sections, 12 equations, 23 figures, 2 tables)

This paper contains 69 sections, 12 equations, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Assume that the characteristic of real samples is influenced by some latent variables $z_i$ (e.g. color and shape). If the support of the latent space is a cartesian product of the individual latent variables (e.g. pairs (color, shape) typically finite and discrete), we say that the samples depend on combinations of the individual latent variables. Conversely, if the latent variables $z_i$ are defined on a closed interval (e.g. $z_i \in [0, 1]$), such that any value of $z_i$ induces a meaningful sample (e.g. any color in between green and pink), then we say that there exists an interpolation between extreme samples, where the extreme examples are samples whose latent are close to the extrema of their support (e.g. pure pink or pure green colors).
  • Figure 2: Images from CelebA dataset. Left: clearly non-smiling face. Two center: mild smiles. Right: clearly smiling.
  • Figure 3: Diagram of the training and sampling process for mild smiles. The classifier and DDPM are trained on extreme examples only, i.e. the DDPM is trained on clearly smiling and clearly non-smiling faces. Nonetheless, we demonstrate that DDPMs can generate faces with mild attributes (middle) with a modified sampling scheme, despite never encountering those at training. The key for sampling mild attribute is to use the score of the classifier for both classes instead of one as in regular classifier-guided sampling. Importantly, we do not modify the DDPM training procedure.
  • Figure 4: Synthetic samples generated with multi-guidance using a DDPM trained on extreme images only. According to the evaluation classifier, the "Smiling" likelihood of the pictures lie in $[0.49, 0.51]$.
  • Figure 5: Empirical distribution of pictures according to the evaluation classifier. Left: extreme training examples versus samples from the unconditional diffusion. Right: Samples from the unconditional model versus images sampled with multi-guidance.
  • ...and 18 more figures

Theorems & Definitions (1)

  • Definition 2.1: wiedemer2023compositional