Table of Contents
Fetching ...

Diffusion Controller: Framework, Algorithms and Parameterization

Tong Yang, Moonkyung Ryu, Chih-Wei Hsu, Guy Tennenholtz, Yuejie Chi, Craig Boutilier, Bo Dai

TL;DR

The Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within linearly-solvable Markov Decision Processes (LS-MDPs), and derives practical reinforcement learning methods for diffusion fine-tuning.

Abstract

Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an $f$-divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.

Diffusion Controller: Framework, Algorithms and Parameterization

TL;DR

The Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within linearly-solvable Markov Decision Processes (LS-MDPs), and derives practical reinforcement learning methods for diffusion fine-tuning.

Abstract

Controllable diffusion generation often relies on various heuristics that are seemingly disconnected without a unified understanding. We bridge this gap with Diffusion Controller (DiffCon), a unified control-theoretic view that casts reverse diffusion sampling as state-only stochastic control within (generalized) linearly-solvable Markov Decision Processes (LS-MDPs). Under this framework, control acts by reweighting the pretrained reverse-time transition kernels, balancing terminal objectives against an -divergence cost. From the resulting optimality conditions, we derive practical reinforcement learning methods for diffusion fine-tuning: (i) f-divergence-regularized policy-gradient updates, including a PPO-style rule, and (ii) a regularizer-determined reward-weighted regression objective with a minimizer-preservation guarantee under the Kullback-Leibler (KL) divergence. The LS-MDP framework further implies a principled model form: the optimal score decomposes into a fixed pretrained baseline plus a lightweight control correction, motivating a side-network parameterization conditioned on exposed intermediate denoising outputs, enabling effective gray-box adaptation with a frozen backbone. Experiments on Stable Diffusion v1.4 across supervised and reward-driven finetuning show consistent gains in preference-alignment win rates and improved quality-efficiency trade-offs versus gray-box baselines and even the parameter-efficient white-box adapter LoRA.
Paper Structure (52 sections, 5 theorems, 120 equations, 13 figures, 11 tables)

This paper contains 52 sections, 5 theorems, 120 equations, 13 figures, 11 tables.

Key Result

Proposition 1

Under our reward setting eq:reward, the gradient of $J_\theta$ is given by

Figures (13)

  • Figure 2: Curves of HPS-v2 win rate against the pretrained model for SFT (left), RWL (middle), and PPO (right).
  • Figure 3: End-of-training win-rate vs. baselines. Each subplot reports three paired comparisons: (gray-box) DiffCon vs. DiffCon-Naive; (white-box) DiffCon-J vs. LoRA; (white-box) DiffCon-S vs. LoRA. (a)-(c): HPS-v2 win rates for SFT/RWL/PPO with orange error bars showing standard deviation; (d): human-evaluated win rate for PPO.
  • Figure 4: Generations with different guidance strengths $\lambda_{\textnormal{model}}$ using SFT with DiffCon. Prompt: "A black cat wearing a suit and smoking a cigar".
  • Figure 5: Generations with different guidance strengths $\lambda_{\textnormal{model}}$ using SFT with DiffCon-J. Prompt: "A bluejay is eating spaghetti".
  • Figure 6: Generations with different guidance strengths $\lambda_{\textnormal{model}}$ using RWL with DiffCon. Prompt: "A crazy looking fish swimming in Alcatraz in the style of artgerm".
  • ...and 8 more figures

Theorems & Definitions (5)

  • Proposition 1: policy gradient
  • Theorem 1: reward weighted loss
  • Proposition 2
  • Lemma 1
  • Lemma 2