Table of Contents
Fetching ...

Diffusion Guidance Is a Controllable Policy Improvement Operator

Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine

TL;DR

CFGRL addresses offline RL and goal-conditioned behavioral cloning by deriving a diffusion-guided, controllable policy-improvement operator that treats policies as products of a reference and an optimality term, $\pi(a|s) \propto \hat{\pi}(a|s)\; f(A(s,a))$, with test-time control via a weight $w$. It leverages classifier-free guidance to implement the optimality conditioning without training a separate value predictor, and trains a diffusion network via flow-matching tosample from the product policy. Theoretical guarantees show that, when $f$ is non-negative and non-decreasing in the advantage, the product policy improves over the reference, and increasing $w$ yields further improvement, trading off adherence to the data with policy improvement. Empirically, CFGRL outperforms standard weighted regression methods in offline RL and provides substantial gains over GCBC across state- and image-based tasks without value-function training, demonstrating the practicality and scalability of diffusion-guided policy optimization. Overall, CFGRL offers a simple, test-time-tunable, drop-in tool for policy extraction and improvement that can be integrated into existing RL pipelines to achieve robust performance gains.

Abstract

At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for "free" across the board.

Diffusion Guidance Is a Controllable Policy Improvement Operator

TL;DR

CFGRL addresses offline RL and goal-conditioned behavioral cloning by deriving a diffusion-guided, controllable policy-improvement operator that treats policies as products of a reference and an optimality term, , with test-time control via a weight . It leverages classifier-free guidance to implement the optimality conditioning without training a separate value predictor, and trains a diffusion network via flow-matching tosample from the product policy. Theoretical guarantees show that, when is non-negative and non-decreasing in the advantage, the product policy improves over the reference, and increasing yields further improvement, trading off adherence to the data with policy improvement. Empirically, CFGRL outperforms standard weighted regression methods in offline RL and provides substantial gains over GCBC across state- and image-based tasks without value-function training, demonstrating the practicality and scalability of diffusion-guided policy optimization. Overall, CFGRL offers a simple, test-time-tunable, drop-in tool for policy extraction and improvement that can be integrated into existing RL pipelines to achieve robust performance gains.

Abstract

At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for "free" across the board.

Paper Structure

This paper contains 14 sections, 5 theorems, 31 equations, 8 figures, 8 tables, 2 algorithms.

Key Result

Lemma 1

For any probability measure $\mu$ on ${\mathbb{R}}$ and any bounded, measurable, non-decreasing functions $g, h: {\mathbb{R}} \to {\mathbb{R}}$,

Figures (8)

  • Figure 1: While conditioning on optimality can create a baseline level of improvement, policies can be further improved by attenuating this conditioning. When $p(o \mid s,a)$ is proportional to a monotonically increasing function of advantage, then attenuation provably increases expected return, and this can be accomplished naturally with diffusion guidnace.
  • Figure 2: Weighted regression methods result in uneven gradient magnitudes within a batch. This can limit the effective signal that each batch provides. In contrast, CFGRL uses a simple conditional diffusion modeling loss with even weighting.
  • Figure 3: CFGRL controls the tradeoff between reference adherence and optimality by adjusting the guidance weighting. This addresses the same motivation behind tuning the temperature in advantage-weighted regression, however, it can be tuned during test time rather than via retraining, and empirically leads to a higher maximum performance.
  • Figure 4: CFGRL can extrapolate beyond the GCBC policy, unlocking further performance gains. In fact, GCBC is implicitly a special case of the CFGRL policy where $w=1$. By instead considering $w>1$, the resulting policy is an improvement over the original. We show that performance steadily increases with $w$ on a range of environments.
  • Figure 5: OGBench environments.
  • ...and 3 more figures

Theorems & Definitions (13)

  • Remark 1: Improvement of product policies
  • Remark 2: Further improvement via attenuation
  • Remark 3: KL-regularized reward-maximization results in product policies awr_peng2019
  • Lemma 1: Chebyshev's sum inequality for probability measures
  • proof
  • Lemma 2
  • proof
  • Lemma 3: Policy improvement theorem for stochastic policies rl_sutton2005rl_silva2023
  • proof
  • Theorem 1: Policy improvement by reweighting
  • ...and 3 more