Table of Contents
Fetching ...

Offline Reinforcement Learning with Generative Trajectory Policies

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen

TL;DR

This paper tackles offline reinforcement learning by addressing the expressiveness–efficiency trade-off in generative policies. It unifies diffusion, Consistency Models, CTMs, and Flow Matching under a continuous-time ODE trajectory framework and introduces Generative Trajectory Policies (GTPs) that learn the full ODE solution map. To make GTP practical, it introduces a stable score-approximation technique and an advantage-weighted objective that blends imitation with value-based policy improvement. Empirically, GTP achieves state-of-the-art results on D4RL benchmarks, including perfect scores on several AntMaze tasks, demonstrating strong expressiveness without prohibitive computation. This work thus provides a principled, scalable pathway to powerful, trajectory-based policies for offline RL, with clear avenues for future efficiency and broader applicability.

Abstract

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

Offline Reinforcement Learning with Generative Trajectory Policies

TL;DR

This paper tackles offline reinforcement learning by addressing the expressiveness–efficiency trade-off in generative policies. It unifies diffusion, Consistency Models, CTMs, and Flow Matching under a continuous-time ODE trajectory framework and introduces Generative Trajectory Policies (GTPs) that learn the full ODE solution map. To make GTP practical, it introduces a stable score-approximation technique and an advantage-weighted objective that blends imitation with value-based policy improvement. Empirically, GTP achieves state-of-the-art results on D4RL benchmarks, including perfect scores on several AntMaze tasks, demonstrating strong expressiveness without prohibitive computation. This work thus provides a principled, scalable pathway to powerful, trajectory-based policies for offline RL, with clear avenues for future efficiency and broader applicability.

Abstract

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

Paper Structure

This paper contains 40 sections, 8 theorems, 73 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Fix a time horizon $T > 0$, let ${\boldsymbol{x}} \sim p_{\mathrm{data}}$, ${\boldsymbol{z}} \sim \mathcal{N}(0,I)$, and define ${\boldsymbol{x}}_t={\boldsymbol{x}} + t {\boldsymbol{z}}$. Define the vector fields $f^\star,\tilde{f}:\mathbb{R}^d\times(0,T] \to \mathbb{R}^d$ by $f^\star ({\boldsymbol{ Assume further that for each $u>s$, $\Phi_{\boldsymbol{\theta}}(\cdot,u,s):\mathbb{R}^d\to\mathbb{R

Figures (6)

  • Figure 1: The two core techniques of the GTP implementation: (a) Stable Score Approximation: the target trajectory (green) is contrasted with a reference (red) computed by a multi-step ODE solver (red dashed arrow). The blue dashed arrow denotes a single-step update obtained from our approximate score, which yields the blue trajectory without multi-step integration. (b) Value-Driven Guidance: the BC trajectory (green) is shifted toward high-value regions so that the learned GTP trajectory (blue) approaches the optimal action while remaining aligned with the data.
  • Figure 2: Illustration of the learning process. A given point ${\boldsymbol{x}}_t$ can be formed by many different linear paths (corresponding to different pairs of $({\boldsymbol{x}}_0, {\boldsymbol{z}})$). The model is trained to learn a single, deterministic "learned trajectory" that represents the conditional expectation of these paths. This forces the model to learn the true underlying generative dynamics.
  • Figure 3: Comparison of training objectives. (a) Standard Consistency Training supervises predictions back to the origin. (b) GTP extends this principle by enforcing self-consistency across arbitrary intervals, enabling direct learning of the solution map.
  • Figure 4: Illustration of weighted behavior cloning. The empirical dataset distribution (black) is reweighted into a value-guided distribution (red), which emphasizes high-value regions while remaining strictly within the data support.
  • Figure 5: Policy visualization in a 2D multi-goal environment.
  • ...and 1 more figures

Theorems & Definitions (18)

  • Theorem 1
  • proof : Proof Sketch
  • Remark 1: Computational Efficiency
  • Remark 2: Training Stability
  • Theorem 2: Advantage-Weighted Objective
  • Remark 3: Practical Implementation
  • Lemma 1: Conditional unbiasedness
  • proof
  • Lemma 2: One-step local bias
  • proof
  • ...and 8 more