Value function estimation using conditional diffusion models for control
Bogdan Mazoure, Walter Talbott, Miguel Angel Bautista, Devon Hjelm, Alexander Toshev, Josh Susskind
TL;DR
This work introduces Diffused Value Function (DVF), a diffusion-model-based approach to estimate and optimize value functions from state sequences without relying on reward or action labels during pretraining. DVF factorizes the value into a state-occupancy component, a reward predictor, and a policy representation, enabling zero-shot evaluation and policy improvement by sampling future states from a learned diffusion model conditioned on the policy. By avoiding high-dimensional autoregressive rollouts and leveraging a one-step Bellman backup, DVF remains scalable to long-horizon tasks and can operate in reward-free or offline data settings, matching or exceeding offline baselines on robotic benchmarks. The results demonstrate DVF’s ability to capture long-horizon dynamics, generate coherent trajectories, and support efficient policy decoding, suggesting a practical path toward leveraging large volumes of imperfect data in robotics.
Abstract
A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative to address, sooner than later, the potential problem of running out of high-quality demonstrations. In this case, instead of collecting only new data via costly human demonstrations or risking a simulation-to-real transfer with uncertain effects, it would be beneficial to leverage vast amounts of readily-available low-quality data. Since classical control algorithms such as behavior cloning or temporal difference learning cannot be used on reward-free or action-free data out-of-the-box, this solution warrants novel training paradigms for continuous control. We propose a simple algorithm called Diffused Value Function (DVF), which learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. This model can be efficiently learned from state sequences (i.e., without access to reward functions nor actions), and subsequently used to estimate the value of each action out-of-the-box. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers, and show promising qualitative and quantitative results on challenging robotics benchmarks.
