Table of Contents
Fetching ...

Test-time Alignment of Diffusion Models without Reward Over-optimization

Sunwoo Kim, Minkyu Kim, Dongmin Park

TL;DR

This paper tackles the problem of aligning diffusion models to downstream rewards without incurring reward over-optimization or losing the model’s diversity. It introduces a training-free, test-time approach called SMCAlign, which samples from a tempered, reward-aware target distribution $p_{tar}(x) \propto p_{data}(x) \exp(r(x)/\alpha)$ using Sequential Monte Carlo tailored for diffusion processes. By incorporating tempered intermediate targets and a locally optimal proposal, the method achieves effective reward optimization while preserving cross-reward generalization and diversity, and it extends to single and multi-objective settings as well as online black-box optimization. The approach offers a robust, scalable alternative to fine-tuning for aligning diffusion models with diverse downstream objectives, with public code available for reproducibility.

Abstract

Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at https://github.com/krafton-ai/DAS.

Test-time Alignment of Diffusion Models without Reward Over-optimization

TL;DR

This paper tackles the problem of aligning diffusion models to downstream rewards without incurring reward over-optimization or losing the model’s diversity. It introduces a training-free, test-time approach called SMCAlign, which samples from a tempered, reward-aware target distribution using Sequential Monte Carlo tailored for diffusion processes. By incorporating tempered intermediate targets and a locally optimal proposal, the method achieves effective reward optimization while preserving cross-reward generalization and diversity, and it extends to single and multi-objective settings as well as online black-box optimization. The approach offers a robust, scalable alternative to fine-tuning for aligning diffusion models with diverse downstream objectives, with public code available for reproducibility.

Abstract

Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at https://github.com/krafton-ai/DAS.
Paper Structure (25 sections, 3 theorems, 8 equations, 6 figures, 1 table)

This paper contains 25 sections, 3 theorems, 8 equations, 6 figures, 1 table.

Key Result

Proposition 1

proof. Appendix proofs

Figures (6)

  • Figure 1: Comparison between our method and existing methods for toy example. Left of dashed line: Samples from pre-trained model trained on mixture of Gaussians, reward-aligned target distribution $p_{tar}$. Right of dashed line: methods for sampling from $p_{tar}$ including previous methods (RL, direct backpropagation, approximate guidance) and ours using SMC. Top: reward $r(X, Y) = -X^2/100-Y^2$, bottom: reward $r(X, Y) = -X^2-(Y-1)^2/10$. EMD denotes sample estimation of Earth Mover's Distance, also known as Wasserstein distance between the sample distribution using each method and the target distribution. Note that samples may exist outside the grid.
  • Figure 2: Target Reward vs. Evaluation Metrics. Top: target is aesthetic score, bottom: target is PickScore. (a), (e) and (b), (f): evaluation of cross-reward generalization using HPSv2 and ImageReward, respectively. (c), (g) and (d), (h): evaluation of diversity using Truncated CLIP Entropy and mean pairwise similarity calculated with LPIPS, respectively. Our method reach similar or better target reward compared to fine-tuning methods (DDPO, AlignProp) while maintaining cross-reward generalization and diversity like guidance methods (DPS, FreeDoM, MPGD), breaking through the pareto front of previous methods.
  • Figure 3: Your caption here
  • Figure 4: Qualitative comparison of T2I alignment. Reward model: PickScore. "A green colored rabbit", "Four wolves in the park", "cat and a dog", "A dog on the moon", "A cat in the style of Van Gogh's Starry Night", "A door that leads to outer space".
  • Figure 5: Overall caption for both figures
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 1: Locally Optimal Proposal
  • Theorem 1: Asymptotic Exactness
  • Theorem 2: Asymptotic Variance and Sample Efficiency