Table of Contents
Fetching ...

Policy-Guided Diffusion

Matthew Thomas Jackson, Michael Tryfan Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, Jakob Foerster

TL;DR

Policy-guided diffusion tackles offline RL's distribution shift by generating full trajectories from offline data with a diffusion model and steering samples toward the target policy via guidance signals. It introduces a behavior-regularized target distribution, approximated through diffusion score manipulation that omits direct behavior-policy gradients to keep data-supported trajectories. Empirically, PGD consistently improves performance of TD3+BC and IQL across MuJoCo and Maze2d datasets, and shows lower dynamics error than autoregressive model-based baselines while achieving comparable target-policy likelihood. The approach offers a practical, drop-in data-generation mechanism that reduces reliance on conservative learning and enables controllable, on-policy synthetic experience for offline RL.

Abstract

In many real-world settings, agents must learn from an offline dataset gathered by some prior behavior policy. Such a setting naturally leads to distribution shift between the behavior policy and the target policy being trained - requiring policy conservatism to avoid instability and overestimation bias. Autoregressive world models offer a different solution to this by generating synthetic, on-policy experience. However, in practice, model rollouts must be severely truncated to avoid compounding error. As an alternative, we propose policy-guided diffusion. Our method uses diffusion models to generate entire trajectories under the behavior distribution, applying guidance from the target policy to move synthetic experience further on-policy. We show that policy-guided diffusion models a regularized form of the target distribution that balances action likelihood under both the target and behavior policies, leading to plausible trajectories with high target policy probability, while retaining a lower dynamics error than an offline world model baseline. Using synthetic experience from policy-guided diffusion as a drop-in substitute for real data, we demonstrate significant improvements in performance across a range of standard offline reinforcement learning algorithms and environments. Our approach provides an effective alternative to autoregressive offline world models, opening the door to the controllable generation of synthetic training data.

Policy-Guided Diffusion

TL;DR

Policy-guided diffusion tackles offline RL's distribution shift by generating full trajectories from offline data with a diffusion model and steering samples toward the target policy via guidance signals. It introduces a behavior-regularized target distribution, approximated through diffusion score manipulation that omits direct behavior-policy gradients to keep data-supported trajectories. Empirically, PGD consistently improves performance of TD3+BC and IQL across MuJoCo and Maze2d datasets, and shows lower dynamics error than autoregressive model-based baselines while achieving comparable target-policy likelihood. The approach offers a practical, drop-in data-generation mechanism that reduces reliance on conservative learning and enables controllable, on-policy synthetic experience for offline RL.

Abstract

In many real-world settings, agents must learn from an offline dataset gathered by some prior behavior policy. Such a setting naturally leads to distribution shift between the behavior policy and the target policy being trained - requiring policy conservatism to avoid instability and overestimation bias. Autoregressive world models offer a different solution to this by generating synthetic, on-policy experience. However, in practice, model rollouts must be severely truncated to avoid compounding error. As an alternative, we propose policy-guided diffusion. Our method uses diffusion models to generate entire trajectories under the behavior distribution, applying guidance from the target policy to move synthetic experience further on-policy. We show that policy-guided diffusion models a regularized form of the target distribution that balances action likelihood under both the target and behavior policies, leading to plausible trajectories with high target policy probability, while retaining a lower dynamics error than an offline world model baseline. Using synthetic experience from policy-guided diffusion as a drop-in substitute for real data, we demonstrate significant improvements in performance across a range of standard offline reinforcement learning algorithms and environments. Our approach provides an effective alternative to autoregressive offline world models, opening the door to the controllable generation of synthetic training data.
Paper Structure (36 sections, 15 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 36 sections, 15 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Offline reinforcement learning with policy-guided diffusion. Offline data from a behavior policy is first used to train a trajectory diffusion model. Synthetic experience is then generated with diffusion, guided by the target policy in order to move trajectories further on-policy. An agent is then trained for multiple steps on the synthetic dataset, before it is regenerated.
  • Figure 2: Trajectories from an illustrative 2D environment, in which the start location is indicated by $\bullet$ and the goals for the behavior and target policies are indicated by $\mathbf{\times}$ and $\mathbf{\times}$. Left: Rollouts from the target policy in the real environment. Right: Offline datasets gathered by the behavior policy suffer from distribution shift and limited sample size. Truncated world models mopokidambi2020morel previously used in offline model-based reinforcement learning offer a partial solution to this problem but suffer from bias due to short rollouts. Meanwhile, unguided diffusion lu2023synthetic can increase the sample size, but maintains the original distribution shift. In contrast, policy-guided diffusion samples from a regularized target distribution, generating entire trajectories with low transition error but higher likelihood under the target distribution.
  • Figure 3: Left: Trajectory probability distribution for an example behavior distribution $p_{\mkern1mu\text{off}}(\bm{\tau})$ and target policy likelihood $q_{\mkern1mu\text{target}}(\bm{\tau})$. Right: Corresponding PGD sampling distribution (\ref{['eq:pgd-dist']}) computed over a range of policy-guidance coefficients $\lambda$. By increasing $\lambda$, we transform from the sampling distribution towards the regions of high target policy likelihood, making PGD an effective mechanism for controlling the level of regularization towards the behavior distribution.
  • Figure 4: Aggregate MuJoCo performance after training on unguided or policy-guided synthetic data under continuous and periodic dataset generation, as well as on the real dataset. For each setting, mean return over TD3+BC and IQL agents is marked, with standard error over 4 seeds (diffusion models and agents) highlighted.
  • Figure 5: Action probability of synthetic trajectories generated by diffusion and PETS models trained on halfcheetah-medium. Target policies are trained on halfcheetah-random, halfcheetah-medium, and halfcheetah-expert datasets, demonstrating robustness to OOD actions. Standard error over 4 diffusion model seeds is shaded (but negligible), with mean computed over 2048 synthetic trajectories.
  • ...and 1 more figures