Test-time Alignment of Diffusion Models without Reward Over-optimization

Sunwoo Kim; Minkyu Kim; Dongmin Park

Test-time Alignment of Diffusion Models without Reward Over-optimization

Sunwoo Kim, Minkyu Kim, Dongmin Park

TL;DR

This paper tackles the problem of aligning diffusion models to downstream rewards without incurring reward over-optimization or losing the model’s diversity. It introduces a training-free, test-time approach called SMCAlign, which samples from a tempered, reward-aware target distribution $p_{tar}(x) \propto p_{data}(x) \exp(r(x)/\alpha)$ using Sequential Monte Carlo tailored for diffusion processes. By incorporating tempered intermediate targets and a locally optimal proposal, the method achieves effective reward optimization while preserving cross-reward generalization and diversity, and it extends to single and multi-objective settings as well as online black-box optimization. The approach offers a robust, scalable alternative to fine-tuning for aligning diffusion models with diverse downstream objectives, with public code available for reproducibility.

Abstract

Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at https://github.com/krafton-ai/DAS.

Test-time Alignment of Diffusion Models without Reward Over-optimization

TL;DR

using Sequential Monte Carlo tailored for diffusion processes. By incorporating tempered intermediate targets and a locally optimal proposal, the method achieves effective reward optimization while preserving cross-reward generalization and diversity, and it extends to single and multi-objective settings as well as online black-box optimization. The approach offers a robust, scalable alternative to fine-tuning for aligning diffusion models with diverse downstream objectives, with public code available for reproducibility.

Abstract

Paper Structure (25 sections, 3 theorems, 8 equations, 6 figures, 1 table)

This paper contains 25 sections, 3 theorems, 8 equations, 6 figures, 1 table.

Introduction
Related Work
Fine-tuning diffusion models for alignment
Guidance
Aligning Pre-trained models and reward over-optimization
Combining Diffusion Models with Sequential Monte Carlos
Diffusion Alignment as Sampling
Aligning Diffusion Models with Rewards: Problem Setting
Limitations of Existing Methods
Tempered Diffusion Posterior Sampling for Reward Alignment SMCAlign: SMC-guided diffusion model alignment
Backward Kernel: Forward Diffusion Process
Intermediate Targets: Approximate Posterior with Tempering
Proposal: Approximating Locally Optimal Proposal
Asymptotic Behavior
Experiments
...and 10 more sections

Key Result

Proposition 1

proof. Appendix proofs

Figures (6)

Figure 1: Comparison between our method and existing methods for toy example. Left of dashed line: Samples from pre-trained model trained on mixture of Gaussians, reward-aligned target distribution $p_{tar}$. Right of dashed line: methods for sampling from $p_{tar}$ including previous methods (RL, direct backpropagation, approximate guidance) and ours using SMC. Top: reward $r(X, Y) = -X^2/100-Y^2$, bottom: reward $r(X, Y) = -X^2-(Y-1)^2/10$. EMD denotes sample estimation of Earth Mover's Distance, also known as Wasserstein distance between the sample distribution using each method and the target distribution. Note that samples may exist outside the grid.
Figure 2: Target Reward vs. Evaluation Metrics. Top: target is aesthetic score, bottom: target is PickScore. (a), (e) and (b), (f): evaluation of cross-reward generalization using HPSv2 and ImageReward, respectively. (c), (g) and (d), (h): evaluation of diversity using Truncated CLIP Entropy and mean pairwise similarity calculated with LPIPS, respectively. Our method reach similar or better target reward compared to fine-tuning methods (DDPO, AlignProp) while maintaining cross-reward generalization and diversity like guidance methods (DPS, FreeDoM, MPGD), breaking through the pareto front of previous methods.
Figure 3: Your caption here
Figure 4: Qualitative comparison of T2I alignment. Reward model: PickScore. "A green colored rabbit", "Four wolves in the park", "cat and a dog", "A dog on the moon", "A cat in the style of Van Gogh's Starry Night", "A door that leads to outer space".
Figure 5: Overall caption for both figures
...and 1 more figures

Theorems & Definitions (3)

Proposition 1: Locally Optimal Proposal
Theorem 1: Asymptotic Exactness
Theorem 2: Asymptotic Variance and Sample Efficiency

Test-time Alignment of Diffusion Models without Reward Over-optimization

TL;DR

Abstract

Test-time Alignment of Diffusion Models without Reward Over-optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)