Diffusion Controller: Framework, Algorithms and Parameterization

Tong Yang; Moonkyung Ryu; Chih-Wei Hsu; Guy Tennenholtz; Yuejie Chi; Craig Boutilier; Bo Dai

Diffusion Controller: Framework, Algorithms and Parameterization

Tong Yang, Moonkyung Ryu, Chih-Wei Hsu, Guy Tennenholtz, Yuejie Chi, Craig Boutilier, Bo Dai

TL;DR

The Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within linearly-solvable Markov Decision Processes (LS-MDPs), and derives practical reinforcement learning methods for diffusion fine-tuning.

Abstract

Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an $f$-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.

Diffusion Controller: Framework, Algorithms and Parameterization

TL;DR

Abstract

-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.

Paper Structure (52 sections, 5 theorems, 120 equations, 13 figures, 11 tables)

This paper contains 52 sections, 5 theorems, 120 equations, 13 figures, 11 tables.

Introduction
Our contribution
Related work
Reinforcement learning for diffusion models.
Score-function parameterization and modular control.
Notation.
Preliminaries
Linearly-Solvable MDP (LS-MDP) todorov2006linearly.
Diffusion models ho2020denoising.
Access levels for diffusion finetuning.
Diffusion Controller through LS-MDP
Framework
Our goal.
Reinforcement Learning Finetuning (RLFT)
Policy Gradient and PPO for DiffCon
...and 37 more sections

Key Result

Proposition 1

Under our reward setting eq:reward, the gradient of $J_\theta$ is given by

Figures (13)

Figure 2: Curves of HPS-v2 win rate against the pretrained model for SFT (left), RWL (middle), and PPO (right).
Figure 3: End-of-training win-rate vs. baselines. Each subplot reports three paired comparisons: (gray-box) DiffCon vs. DiffCon-Naive; (white-box) DiffCon-J vs. LoRA; (white-box) DiffCon-S vs. LoRA. (a)-(c): HPS-v2 win rates for SFT/RWL/PPO with orange error bars showing standard deviation; (d): human-evaluated win rate for PPO.
Figure 4: Generations with different guidance strengths $\lambda_{\textnormal{model}}$ using SFT with DiffCon. Prompt: "A black cat wearing a suit and smoking a cigar".
Figure 5: Generations with different guidance strengths $\lambda_{\textnormal{model}}$ using SFT with DiffCon-J. Prompt: "A bluejay is eating spaghetti".
Figure 6: Generations with different guidance strengths $\lambda_{\textnormal{model}}$ using RWL with DiffCon. Prompt: "A crazy looking fish swimming in Alcatraz in the style of artgerm".
...and 8 more figures

Theorems & Definitions (5)

Proposition 1: policy gradient
Theorem 1: reward weighted loss
Proposition 2
Lemma 1
Lemma 2

Diffusion Controller: Framework, Algorithms and Parameterization

TL;DR

Abstract

Diffusion Controller: Framework, Algorithms and Parameterization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (5)