Diffusion Guidance Is a Controllable Policy Improvement Operator
Kevin Frans, Seohong Park, Pieter Abbeel, Sergey Levine
TL;DR
CFGRL addresses offline RL and goal-conditioned behavioral cloning by deriving a diffusion-guided, controllable policy-improvement operator that treats policies as products of a reference and an optimality term, $\pi(a|s) \propto \hat{\pi}(a|s)\; f(A(s,a))$, with test-time control via a weight $w$. It leverages classifier-free guidance to implement the optimality conditioning without training a separate value predictor, and trains a diffusion network via flow-matching tosample from the product policy. Theoretical guarantees show that, when $f$ is non-negative and non-decreasing in the advantage, the product policy improves over the reference, and increasing $w$ yields further improvement, trading off adherence to the data with policy improvement. Empirically, CFGRL outperforms standard weighted regression methods in offline RL and provides substantial gains over GCBC across state- and image-based tasks without value-function training, demonstrating the practicality and scalability of diffusion-guided policy optimization. Overall, CFGRL offers a simple, test-time-tunable, drop-in tool for policy extraction and improvement that can be integrated into existing RL pipelines to achieve robust performance gains.
Abstract
At the core of reinforcement learning is the idea of learning beyond the performance in the data. However, scaling such systems has proven notoriously tricky. In contrast, techniques from generative modeling have proven remarkably scalable and are simple to train. In this work, we combine these strengths, by deriving a direct relation between policy improvement and guidance of diffusion models. The resulting framework, CFGRL, is trained with the simplicity of supervised learning, yet can further improve on the policies in the data. On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance. Of particular importance, CFGRL can operate without explicitly learning a value function, allowing us to generalize simple supervised methods (e.g., goal-conditioned behavioral cloning) to further prioritize optimality, gaining performance for "free" across the board.
