Diffusion Guidance Is a Controllable Policy Improvement Operator

Kevin Frans; Seohong Park; Pieter Abbeel; Sergey Levine

Diffusion Guidance Is a Controllable Policy Improvement Operator

Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine

TL;DR

CFGRL addresses offline RL and goal-conditioned behavioral cloning by deriving a diffusion-guided, controllable policy-improvement operator that treats policies as products of a reference and an optimality term, $\pi(a|s) \propto \hat{\pi}(a|s)\; f(A(s,a))$, with test-time control via a weight $w$. It leverages classifier-free guidance to implement the optimality conditioning without training a separate value predictor, and trains a diffusion network via flow-matching tosample from the product policy. Theoretical guarantees show that, when $f$ is non-negative and non-decreasing in the advantage, the product policy improves over the reference, and increasing $w$ yields further improvement, trading off adherence to the data with policy improvement. Empirically, CFGRL outperforms standard weighted regression methods in offline RL and provides substantial gains over GCBC across state- and image-based tasks without value-function training, demonstrating the practicality and scalability of diffusion-guided policy optimization. Overall, CFGRL offers a simple, test-time-tunable, drop-in tool for policy extraction and improvement that can be integrated into existing RL pipelines to achieve robust performance gains.

Abstract

At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for "free" across the board.

Diffusion Guidance Is a Controllable Policy Improvement Operator

TL;DR

, with test-time control via a weight

. It leverages classifier-free guidance to implement the optimality conditioning without training a separate value predictor, and trains a diffusion network via flow-matching tosample from the product policy. Theoretical guarantees show that, when

is non-negative and non-decreasing in the advantage, the product policy improves over the reference, and increasing

yields further improvement, trading off adherence to the data with policy improvement. Empirically, CFGRL outperforms standard weighted regression methods in offline RL and provides substantial gains over GCBC across state- and image-based tasks without value-function training, demonstrating the practicality and scalability of diffusion-guided policy optimization. Overall, CFGRL offers a simple, test-time-tunable, drop-in tool for policy extraction and improvement that can be integrated into existing RL pipelines to achieve robust performance gains.

Diffusion Guidance Is a Controllable Policy Improvement Operator

TL;DR

Abstract

Diffusion Guidance Is a Controllable Policy Improvement Operator

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (13)