Table of Contents
Fetching ...

Training-free Diffusion Model Alignment with Sampling Demons

Po-Hung Yeh, Kuang-Huei Lee, Jun-Cheng Chen

TL;DR

This work tackles the challenge of aligning pre-trained diffusion-based image generation with user preferences without retraining. It introduces Demon, an inference-time, stochastic-noise optimization framework that guides denoising by shaping reverse-time perturbations, accommodating non-differentiable reward signals such as VLM API outputs and human judgments. Two concrete variants, Tanh Demon and Boltzmann Demon, are proposed, each with theoretical guarantees on improving the final reward over standard PF-ODE-based sampling. The authors provide a rigorous link between the reward estimate r_beta and its ODE proxy r ∘ c, and demonstrate substantial improvements in aesthetics and alignment across SD v1.4/XL under various reward objectives, all without backpropagation or retraining. The approach is plug-and-play and scalable, broadening the practical use of diffusion systems by leveraging non-differentiable and human-derived signals while offering a public implementation.

Abstract

Aligning diffusion models with user preferences has been a key challenge. Existing methods for aligning diffusion models either require retraining or are limited to differentiable reward functions. To address these limitations, we propose a stochastic optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through stochastic optimization. We provide comprehensive theoretical and empirical evidence to support and validate our approach, including experiments that use non-differentiable sources of rewards such as Visual-Language Model (VLM) APIs and human judgements. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models. Our method can be easily integrated with existing diffusion models without further training. Our experiments show that the proposed approach significantly improves the average aesthetics scores for text-to-image generation. Implementation is available at https://github.com/aiiu-lab/DemonSampling.

Training-free Diffusion Model Alignment with Sampling Demons

TL;DR

This work tackles the challenge of aligning pre-trained diffusion-based image generation with user preferences without retraining. It introduces Demon, an inference-time, stochastic-noise optimization framework that guides denoising by shaping reverse-time perturbations, accommodating non-differentiable reward signals such as VLM API outputs and human judgments. Two concrete variants, Tanh Demon and Boltzmann Demon, are proposed, each with theoretical guarantees on improving the final reward over standard PF-ODE-based sampling. The authors provide a rigorous link between the reward estimate r_beta and its ODE proxy r ∘ c, and demonstrate substantial improvements in aesthetics and alignment across SD v1.4/XL under various reward objectives, all without backpropagation or retraining. The approach is plug-and-play and scalable, broadening the practical use of diffusion systems by leveraging non-differentiable and human-derived signals while offering a public implementation.

Abstract

Aligning diffusion models with user preferences has been a key challenge. Existing methods for aligning diffusion models either require retraining or are limited to differentiable reward functions. To address these limitations, we propose a stochastic optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through stochastic optimization. We provide comprehensive theoretical and empirical evidence to support and validate our approach, including experiments that use non-differentiable sources of rewards such as Visual-Language Model (VLM) APIs and human judgements. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models. Our method can be easily integrated with existing diffusion models without further training. Our experiments show that the proposed approach significantly improves the average aesthetics scores for text-to-image generation. Implementation is available at https://github.com/aiiu-lab/DemonSampling.
Paper Structure (62 sections, 7 theorems, 52 equations, 10 figures, 16 tables, 3 algorithms)

This paper contains 62 sections, 7 theorems, 52 equations, 10 figures, 16 tables, 3 algorithms.

Key Result

Lemma 1

We have: where ${\mathbf{x}}_0$ is sampled from eq:sde_beta.

Figures (10)

  • Figure 1: Illustration of Demon. Given a reverse-time SDE for denoising and an interval $[t_{\text{max}}, t_{\text{min}}]$, we first discretize it into $T$ steps, $t_{\mathrm{\max}} > \cdots > t > t - \Delta > \cdots > t_{\mathrm{\min}}$. At every reverse-time denoising step, from $t$ to $t - \Delta$, we synthesize an "optimal" noise $\mathbf{z}^*$ from $K$ i.i.d. noises w.r.t a given reward source and use $\mathbf{z}^*$ to seed the step. This enables guiding the denoising process towards generating images that are more aligned with the reward source and the preference that the reward source represents. More details are presented in \ref{['sec:demon']}.
  • Figure 2: The illustration of the proximity between the $r_\beta$ and $r \circ {\mathbf{c}}$. In this figure, the $\beta$ is nonzero and $r$ is near harmonic (i.e., $\nabla^2 r \approx 0.$). The red points indicate i.i.d. SDE samples and the purple ODE approximation of ${\mathbf{x}}_t$. The green line indicates the expectation of the rewards of the SDE samples (e.g., an approximate estimation, $\frac{1}{4}\sum_{i=1}^4 r({\mathbf{x}}_0^{(i)})$).
  • Figure 3: An illustration of the Tanh Demon sampling method where $K=4$. (a) A SDE step generates several samples, each determined by sampled noise $\mathbf{z}_k$. We use Tanh Demon to classify each noise sample as "low-reward" or "high-reward" w.r.t $r_\beta(\boldsymbol{x}_t)$ based on their respective reward estimates. (b) We penalize low-reward noise with $\tanh$ to multiply a negative weight which is equivalent to flipping the noise, (c) It shows how the post-processed noises are averaged and projected onto the high-dimensional sphere, resulting in a feasible noise representation ${\mathbf{z}}^*$ with high-reward estimate.
  • Figure 4: Performance comparison of the proposed algorithm and other baseline methods in terms of the number of reward queries and execution time; the dependent variable is $T$, which is suggested to be larger for SDE solver to reduce truncation error. Although DOODL can achieve similar results to ours, it relies on reward backpropagation, whereas our backpropagation-free methods do not require this. The shaded areas and the radii of solid circles represent the standard deviation of the evaluation results.
  • Figure 5: We design an application for manual interaction with our algorithm. Our author selects the images, and the criteria are based on the author's preference (non-preferred images are kept unselected), where the author tries to align the reference image. We evaluate performance by measuring the cosine similarity of DINOv2 features between the targeted and reference images.
  • ...and 5 more figures

Theorems & Definitions (18)

  • Lemma 1: It√¥ Integral Representation of Reward Proximity Error. Proof is in \ref{['subsec:proof_ito']}
  • Lemma 2: Improvement Guarantee of Tanh Demon. Proof in \ref{['sec:tanh_proof']}
  • Lemma 1
  • proof
  • proof
  • Claim 1
  • proof
  • Lemma 2
  • Claim 2
  • proof
  • ...and 8 more