Table of Contents
Fetching ...

Value function estimation using conditional diffusion models for control

Bogdan Mazoure, Walter Talbott, Miguel Angel Bautista, Devon Hjelm, Alexander Toshev, Josh Susskind

TL;DR

This work introduces Diffused Value Function (DVF), a diffusion-model-based approach to estimate and optimize value functions from state sequences without relying on reward or action labels during pretraining. DVF factorizes the value into a state-occupancy component, a reward predictor, and a policy representation, enabling zero-shot evaluation and policy improvement by sampling future states from a learned diffusion model conditioned on the policy. By avoiding high-dimensional autoregressive rollouts and leveraging a one-step Bellman backup, DVF remains scalable to long-horizon tasks and can operate in reward-free or offline data settings, matching or exceeding offline baselines on robotic benchmarks. The results demonstrate DVF’s ability to capture long-horizon dynamics, generate coherent trajectories, and support efficient policy decoding, suggesting a practical path toward leveraging large volumes of imperfect data in robotics.

Abstract

A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative to address, sooner than later, the potential problem of running out of high-quality demonstrations. In this case, instead of collecting only new data via costly human demonstrations or risking a simulation-to-real transfer with uncertain effects, it would be beneficial to leverage vast amounts of readily-available low-quality data. Since classical control algorithms such as behavior cloning or temporal difference learning cannot be used on reward-free or action-free data out-of-the-box, this solution warrants novel training paradigms for continuous control. We propose a simple algorithm called Diffused Value Function (DVF), which learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. This model can be efficiently learned from state sequences (i.e., without access to reward functions nor actions), and subsequently used to estimate the value of each action out-of-the-box. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers, and show promising qualitative and quantitative results on challenging robotics benchmarks.

Value function estimation using conditional diffusion models for control

TL;DR

This work introduces Diffused Value Function (DVF), a diffusion-model-based approach to estimate and optimize value functions from state sequences without relying on reward or action labels during pretraining. DVF factorizes the value into a state-occupancy component, a reward predictor, and a policy representation, enabling zero-shot evaluation and policy improvement by sampling future states from a learned diffusion model conditioned on the policy. By avoiding high-dimensional autoregressive rollouts and leveraging a one-step Bellman backup, DVF remains scalable to long-horizon tasks and can operate in reward-free or offline data settings, matching or exceeding offline baselines on robotic benchmarks. The results demonstrate DVF’s ability to capture long-horizon dynamics, generate coherent trajectories, and support efficient policy decoding, suggesting a practical path toward leveraging large volumes of imperfect data in robotics.

Abstract

A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative to address, sooner than later, the potential problem of running out of high-quality demonstrations. In this case, instead of collecting only new data via costly human demonstrations or risking a simulation-to-real transfer with uncertain effects, it would be beneficial to leverage vast amounts of readily-available low-quality data. Since classical control algorithms such as behavior cloning or temporal difference learning cannot be used on reward-free or action-free data out-of-the-box, this solution warrants novel training paradigms for continuous control. We propose a simple algorithm called Diffused Value Function (DVF), which learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. This model can be efficiently learned from state sequences (i.e., without access to reward functions nor actions), and subsequently used to estimate the value of each action out-of-the-box. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers, and show promising qualitative and quantitative results on challenging robotics benchmarks.
Paper Structure (21 sections, 13 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 13 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: The three crucial components of DVF: (left) construct tuples $(s_t,s_{t+1},s_{t+\Delta t})$ for training the diffusion model; (middle) architecture of the diffusion model, which takes in future noisy state $x$, current state $s_t$, time offset $\Delta t$, policy embedding $\phi(\pi)$ and diffusion timestep $t_d$ and processes them using the Perceiver I/O architecture jaegle2021perceiver to predict the noise; (right) Sampling mechanism based on DPPM ho2020denoising is used with a reward model to estimate the value function
  • Figure 2: (Left) Pairwise plot of normalized returns versus the value function estimated by DVF, (Middle) Pairwise plot of normalized value function versus normalized reward at future state and (Right) normalized value function and normalized environment returns versus training gradient steps.
  • Figure 3: (a, c) Ground truth data distribution for the u-maze and large maze from the Maze 2d environment. (b, d) Conditional distribution of future states $s_{t+\Delta t}|s_0,\phi(\pi_i)$ given the starting state in the bottom left corner and the policy index. The diffusion model correctly identifies and separates the three state distributions in both mazes.
  • Figure 4: Samples from the learned diffusion model with increasing values of discount factor $\gamma$, with a starting state in the lower left of the maze. As $\gamma$ increases, the model generates samples further along the trajectory leading to the furthest point of the maze. Ground truth data shown in \ref{['fig:policy_conditionning_maze2d']}(a)
  • Figure 5: Normalized returns obtained by DVF, behavior cloning, CQL on 4 challenging robotic tasks from the PyBullet offline suite, together with average returns in each dataset (Data in the plot).
  • ...and 1 more figures