Table of Contents
Fetching ...

BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Yunpeng Qing, Yixiao Chi, Shuo Chen, Shunyu Liu, Kelu Yao, Sixu Lin, Litao Liu, Changqing Zou

TL;DR

BiTrajDiff tackles offline RL data bias by jointly modeling forward-future and backward-history trajectories from intermediate anchor states using two diffusion models and classifier-free guidance. The method stitches forward and backward paths at shared anchors, then completes trajectories with an inverse dynamics model and a reward model, followed by a two-stage OOD/greedy filtering to ensure data quality. Empirical results on the D4RL suite show BiTrajDiff consistently outperforms forward-only data augmentation baselines across multiple offline RL backbones, especially in sparse-reward tasks and long-horizon planning. The work demonstrates that bidirectional trajectory generation yields richer, more globally connected behavior patterns, improving offline RL performance and robustness.

Abstract

Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

TL;DR

BiTrajDiff tackles offline RL data bias by jointly modeling forward-future and backward-history trajectories from intermediate anchor states using two diffusion models and classifier-free guidance. The method stitches forward and backward paths at shared anchors, then completes trajectories with an inverse dynamics model and a reward model, followed by a two-stage OOD/greedy filtering to ensure data quality. Empirical results on the D4RL suite show BiTrajDiff consistently outperforms forward-only data augmentation baselines across multiple offline RL backbones, especially in sparse-reward tasks and long-horizon planning. The work demonstrates that bidirectional trajectory generation yields richer, more globally connected behavior patterns, improving offline RL performance and robustness.

Abstract

Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

Paper Structure

This paper contains 28 sections, 6 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustrative diagram of our Bidirectional Trajectory Diffusion (BiTrajDiff) method.
  • Figure 2: Performance improvement comparison of offline RL algorithms augmented with single- and bi-directional diffusion trajectories. The task abbreviations are listed in Table \ref{['tab::abstart-gym']}.
  • Figure 3: Learning curves of BiTrajDiff with data from different variants related to trajectory filters. Detailed results are reported in Appendix \ref{['supp::exp-traj-filter']}.
  • Figure 4: Compare the returns of BiTrajDiff with different augmented data ratios $\sigma$ in the walker2d-medium-replay task. Detailed results are reported in Appendix \ref{['supp::exp-data-ratio']}.
  • Figure 5: Test returns comparisons between other DA baselines and our BiTrajDiff under varying $n$-step TD estimators. Detailed results are reported in Appendix \ref{['supp::exp-n-step-td-estimator']}.
  • ...and 2 more figures