Table of Contents
Fetching ...

Variational Control for Guidance in Diffusion Models

Kushagra Pandey, Farrin Marouf Sofian, Felix Draxler, Theofanis Karaletsos, Stephan Mandt

TL;DR

This work reframes guidance in pretrained diffusion models as a variational control problem, introducing Diffusion Trajectory Matching (DTM) to steer diffusion trajectories toward a terminal constraint without retraining. By formulating a terminal cost and a transient KL divergence, and implementing greedy, stepwise optimization, the framework unifies existing guidance methods and enables non-linear control via Non-linear Diffusion Trajectory Matching (NDTM). For DDIM samplers, NDTM yields tractable bounds that balance deviation from unguided dynamics with task-specific terminal objectives, achieving state-of-the-art results on challenging inverse problems and effective style guidance with latent-space diffusion models. The approach offers a flexible, training-free mechanism to adapt pretrained diffusion priors to diverse tasks, with potential for deeper theoretical analysis and faster sampling strategies in future work.

Abstract

Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available at https://github.com/czi-ai/oc-guidance

Variational Control for Guidance in Diffusion Models

TL;DR

This work reframes guidance in pretrained diffusion models as a variational control problem, introducing Diffusion Trajectory Matching (DTM) to steer diffusion trajectories toward a terminal constraint without retraining. By formulating a terminal cost and a transient KL divergence, and implementing greedy, stepwise optimization, the framework unifies existing guidance methods and enables non-linear control via Non-linear Diffusion Trajectory Matching (NDTM). For DDIM samplers, NDTM yields tractable bounds that balance deviation from unguided dynamics with task-specific terminal objectives, achieving state-of-the-art results on challenging inverse problems and effective style guidance with latent-space diffusion models. The approach offers a flexible, training-free mechanism to adapt pretrained diffusion priors to diverse tasks, with potential for deeper theoretical analysis and faster sampling strategies in future work.

Abstract

Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available at https://github.com/czi-ai/oc-guidance

Paper Structure

This paper contains 36 sections, 2 theorems, 40 equations, 12 figures, 13 tables, 1 algorithm.

Key Result

Proposition 3.1

For the diffusion posterior parameterization in DDIM song2022denoisingdiffusionimplicitmodels, the NDTM objective in Eq. eq:ndtm_cost has the following tractable upper bound (see proof in Appendix subsec:ddim_proof), where $\bar{{\mathbf{x}}}_t = {\mathbf{x}}_t + \gamma {\mathbf{u}}_t$ is the guided state and the coefficients $\kappa_t = \frac{\gamma \sqrt{\alpha_{t-1}}}{\sqrt{\alpha_t}}$ and $\t

Figures (12)

  • Figure 1: Our method guides diffusion sampling to fulfill external constraints. To this end, we optimize the local direction ${\mathbf{u}}_t^*$ via external constraints while respecting the original trajectory, see \ref{['eq:dtm_cost_final']} (left, center). This recovers more accurate reconstructions across tasks compared to classical guidance methods: Nonlinear deblurring (Right). Our method accurately captures most details, while competing methods introduce artifacts in the generated reconstructions.
  • Figure 2: NDTM outperforms competing baselines on blind image deblurring (BID) with Gaussian (top) and Motion (bottom) kernels. NDTM accurately captures most details, while competing methods introduce artifacts in the generated reconstructions.
  • Figure 3: Style Guidance with Stable Diffusion. NDTM (proposed) provides a better tradeoff between Prompt adherence and Style adherence using Stable Diffusion 1.4. While baselines can introduce random artifacts in generated samples, NDTM preserves sample quality while exhibiting better style adherence. (Top Panel) Reference Style Images. (Bottom Panel) Samples corresponding to the reference style images in the Top Panel. The CLIP radford2021learningtransferablevisualmodels score (higher is better) represents the similarity between the generated image and the text prompt, indicating prompt adherence. The Style score (lower is better) represents the distance between CLIP features for the reference style and the generated image, indicating style adherence. We present additional quantitative results in Table \ref{['table:style_guidance']}.
  • Figure 4: Impact of different design choices in NDTM on Distortion (PSNR) and Perception (LPIPS) for the non-linear deblur task. (a, b) The extent of guidance can be jointly controlled by varying the terminal loss weight ($w_T$) and the weight ($\gamma$). (c, d) Compute vs quality can be traded off by jointly varying the number of optimization steps (N) and the number of diffusion steps.
  • Figure 5: The optimal variational controls hierarchically refine image features over time. (Top Row) Non-Linear Deblur (Bottom Row) Random Inpainting. (Left to Right) We visualize optimal controls at different times $t_0 > t_1 > t_2 > t_3$ in diffusion sampling, progressively capturing coarse to fine details.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Proposition 3.1
  • Proposition 1.1
  • proof