Table of Contents
Fetching ...

Inference-Time Alignment of Diffusion Models with Direct Noise Optimization

Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, Tsung-Hui Chang

TL;DR

This work tackles aligning diffusion models to task-specific rewards without fine-tuning by introducing Direct Noise Optimization (DNO), which optimizes noise during sampling to maximize a reward. It provides a theoretical view that DNO yields an improved sampling distribution, and identifies OOD reward-hacking as a key risk, mitigated by probability-regularized noise optimization grounded in concentration inequalities. To handle non-differentiable rewards, the authors develop Hybrid gradient estimators and a Zeroth-Order baseline, with Hybrid-2 showing strong performance and efficiency. Empirically, DNO achieves state-of-the-art reward scores under practical time budgets across brightness, aesthetics, and other human-aligned rewards, while avoiding heavy fine-tuning and maintaining broader prompt compatibility.

Abstract

In this work, we focus on the alignment problem of diffusion models with a continuous reward function, which represents specific objectives for downstream tasks, such as increasing darkness or improving the aesthetics of images. The central goal of the alignment problem is to adjust the distribution learned by diffusion models such that the generated samples maximize the target reward function. We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimizes the injected noise during the sampling process of diffusion models. By design, DNO operates at inference-time, and thus is tuning-free and prompt-agnostic, with the alignment occurring in an online fashion during generation. We rigorously study the theoretical properties of DNO and also propose variants to deal with non-differentiable reward functions. Furthermore, we identify that naive implementation of DNO occasionally suffers from the out-of-distribution reward hacking problem, where optimized samples have high rewards but are no longer in the support of the pretrained distribution. To remedy this issue, we leverage classical high-dimensional statistics theory to an effective probability regularization technique. We conduct extensive experiments on several important reward functions and demonstrate that the proposed DNO approach can achieve state-of-the-art reward scores within a reasonable time budget for generation.

Inference-Time Alignment of Diffusion Models with Direct Noise Optimization

TL;DR

This work tackles aligning diffusion models to task-specific rewards without fine-tuning by introducing Direct Noise Optimization (DNO), which optimizes noise during sampling to maximize a reward. It provides a theoretical view that DNO yields an improved sampling distribution, and identifies OOD reward-hacking as a key risk, mitigated by probability-regularized noise optimization grounded in concentration inequalities. To handle non-differentiable rewards, the authors develop Hybrid gradient estimators and a Zeroth-Order baseline, with Hybrid-2 showing strong performance and efficiency. Empirically, DNO achieves state-of-the-art reward scores under practical time budgets across brightness, aesthetics, and other human-aligned rewards, while avoiding heavy fine-tuning and maintaining broader prompt compatibility.

Abstract

In this work, we focus on the alignment problem of diffusion models with a continuous reward function, which represents specific objectives for downstream tasks, such as increasing darkness or improving the aesthetics of images. The central goal of the alignment problem is to adjust the distribution learned by diffusion models such that the generated samples maximize the target reward function. We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimizes the injected noise during the sampling process of diffusion models. By design, DNO operates at inference-time, and thus is tuning-free and prompt-agnostic, with the alignment occurring in an online fashion during generation. We rigorously study the theoretical properties of DNO and also propose variants to deal with non-differentiable reward functions. Furthermore, we identify that naive implementation of DNO occasionally suffers from the out-of-distribution reward hacking problem, where optimized samples have high rewards but are no longer in the support of the pretrained distribution. To remedy this issue, we leverage classical high-dimensional statistics theory to an effective probability regularization technique. We conduct extensive experiments on several important reward functions and demonstrate that the proposed DNO approach can achieve state-of-the-art reward scores within a reasonable time budget for generation.
Paper Structure (28 sections, 3 theorems, 31 equations, 17 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 3 theorems, 31 equations, 17 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Assuming that $r\circ M_{\theta}$ is $L$-smooth, namely, $\|\nabla r\circ M_{\theta}(z)-\nabla r\circ M_{\theta}(z')\|\leq L\|z-z'\|$ for any $z\neq z'$, it holds true that

Figures (17)

  • Figure 1: Non-cherry-picked examples of aligning Stable Diffusion XL podell2024sdxl with Direct Noise Optimization across four reward functions and four popular prompts from Reddit. Note that this effect is achieved without fine-tuning the diffusion models. The experiment was conducted on a single A800 GPU. More details for these examples can be found in Appendix \ref{['app:details']}.
  • Figure 2: Overview of the DNO procedure with the DDIM sampling algorithm: DNO seeks to optimize only those Gaussian noise vectors $\{x_T,z_1,z_2...,z_T\}$ to maximize the reward value of a single generated sample $x_0$. To facilitate the gradient backpropagation from $x_0$ to $\{x_T,z_1,z_2...,z_T\}$, we leverage the technique of gradient checkpointing. It is worth noting that when using $\eta=0$ for DDIM sampling, there is no need to compute the gradient for $z_1,...,z_T$, as the generated sample $x_0$ depends exclusively on $x_T$. When computing the gradient from $r(x_0)$ to $x_0$, we can use either ground-truth gradient $\nabla r$ or an estimated gradient $\widehat{\nabla r}$, depending on whether the reward function $r(\cdot)$ is differentiable.
  • Figure 3: Example 1: Evolution of the sample distribution of a toy diffusion model while running DNO to maximize a non-convex reward function.
  • Figure 4: ODE vs. SDE for optimization. The x-axis refers to the number of gradient ascent steps during optimization.
  • Figure 5: Examples of OOD Reward-Hacking
  • ...and 12 more figures

Theorems & Definitions (7)

  • Remark 1
  • Theorem 1
  • Lemma 1: wainwright2019high
  • Remark 2
  • Remark 3
  • proof : Proof for Theorem \ref{['thm:improve']}
  • Lemma 2: Descent Lemma bertsekas1997nonlinear