Table of Contents
Fetching ...

Maximize Your Diffusion: A Study into Reward Maximization and Alignment for Diffusion-based Control

Dom Huh, Prasant Mohapatra

TL;DR

This work studies reward alignment for diffusion-based control (DMC) by casting diffusion path denoising as a decision process and optimizing it to maximize returns under a data-compatibility constraint. It evaluates four alignment families—reinforcement learning, direct preference optimization, supervised fine-tuning, and cascading diffusion—and proposes a sequential unification that iteratively applies RL, DPO, and SFT, with cascading applied at inference. Across planning-based and policy-based DMC on diverse offline RL benchmarks, the approach yields higher returns and reduced variance, demonstrating improved sample efficiency and stability in offline settings. The findings suggest a practical, modular roadmap for aligning diffusion-based controllers to rewards, with online fine-tuning and parameter-efficient adapters further enhancing robustness and scalability.

Abstract

Diffusion-based planning, learning, and control methods present a promising branch of powerful and expressive decision-making solutions. Given the growing interest, such methods have undergone numerous refinements over the past years. However, despite these advancements, existing methods are limited in their investigations regarding general methods for reward maximization within the decision-making process. In this work, we study extensions of fine-tuning approaches for control applications. Specifically, we explore extensions and various design choices for four fine-tuning approaches: reward alignment through reinforcement learning, direct preference optimization, supervised fine-tuning, and cascading diffusion. We optimize their usage to merge these independent efforts into one unified paradigm. We show the utility of such propositions in offline RL settings and demonstrate empirical improvements over a rich array of control tasks.

Maximize Your Diffusion: A Study into Reward Maximization and Alignment for Diffusion-based Control

TL;DR

This work studies reward alignment for diffusion-based control (DMC) by casting diffusion path denoising as a decision process and optimizing it to maximize returns under a data-compatibility constraint. It evaluates four alignment families—reinforcement learning, direct preference optimization, supervised fine-tuning, and cascading diffusion—and proposes a sequential unification that iteratively applies RL, DPO, and SFT, with cascading applied at inference. Across planning-based and policy-based DMC on diverse offline RL benchmarks, the approach yields higher returns and reduced variance, demonstrating improved sample efficiency and stability in offline settings. The findings suggest a practical, modular roadmap for aligning diffusion-based controllers to rewards, with online fine-tuning and parameter-efficient adapters further enhancing robustness and scalability.

Abstract

Diffusion-based planning, learning, and control methods present a promising branch of powerful and expressive decision-making solutions. Given the growing interest, such methods have undergone numerous refinements over the past years. However, despite these advancements, existing methods are limited in their investigations regarding general methods for reward maximization within the decision-making process. In this work, we study extensions of fine-tuning approaches for control applications. Specifically, we explore extensions and various design choices for four fine-tuning approaches: reward alignment through reinforcement learning, direct preference optimization, supervised fine-tuning, and cascading diffusion. We optimize their usage to merge these independent efforts into one unified paradigm. We show the utility of such propositions in offline RL settings and demonstrate empirical improvements over a rich array of control tasks.

Paper Structure

This paper contains 26 sections, 16 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A illustration of the reward maximization of diffusion model for control, visualizing the diffusion process at $\tau \shortrightarrow \tau - 1$ such that $z^{\tau - 1}_t$ is denoised towards the target distribution $x^0_t$ but also aligned with maximize the return distribution $G_t$, shown in red.
  • Figure 2: A visualization of the effects of alignment to the diffusion process. Given a foundation DDPM, whose outputs and diffusion field over the entire state space are shown in gray, we align its diffusion process to maximize three separate reward functions, shown in orange, green and blue.
  • Figure 3: A visual illustration of the two DMC frameworks on Walker2D task, where the planning-based DMC is forecasting future states and the policy-based DMC generates actions directly.
  • Figure 4: Visualization of the Nav1D task and an example of an agent's trajectory, shown in green, and the actions space is shown in gray. The reward distributions at every time step are shown in blue.
  • Figure 5: Alignment Learning Curve for Diffuser (DDPM) on the D4RL Medium Datasets, where the average and $\pm1$ standard deviation is shown over 64 episode seeds. The yellow vertical line indicates when the online fine-tuning is introduced.
  • ...and 1 more figures