Table of Contents
Fetching ...

Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens

Joseph Clinton, Robert Lieck

TL;DR

The paper tackles offline reinforcement learning with long-horizon tasks by addressing compounding errors in autoregressive models. It introduces the Planning Transformer, which augments the Decision Transformer with Planning Tokens that encode high-level, long-horizon information predicted at regular intervals to guide the policy. Through Plan sampling, goal conditioning, and a unified training/inference pipeline, the approach achieves state-of-the-art or competitive results on challenging D4RL benchmarks while enhancing interpretability via plan visualizations and attention maps. The work demonstrates that explicit, dual-timescale planning signals can improve trajectory stitching and long-horizon reasoning in a simple, unified model, with potential extensions to online learning and other domains. Overall, the Planning Transformer provides a practical, interpretable pathway to integrate planning with powerful sequence models in Offline RL and beyond.

Abstract

Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.

Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens

TL;DR

The paper tackles offline reinforcement learning with long-horizon tasks by addressing compounding errors in autoregressive models. It introduces the Planning Transformer, which augments the Decision Transformer with Planning Tokens that encode high-level, long-horizon information predicted at regular intervals to guide the policy. Through Plan sampling, goal conditioning, and a unified training/inference pipeline, the approach achieves state-of-the-art or competitive results on challenging D4RL benchmarks while enhancing interpretability via plan visualizations and attention maps. The work demonstrates that explicit, dual-timescale planning signals can improve trajectory stitching and long-horizon reasoning in a simple, unified model, with potential extensions to online learning and other domains. Overall, the Planning Transformer provides a practical, interpretable pathway to integrate planning with powerful sequence models in Offline RL and beyond.

Abstract

Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.
Paper Structure (31 sections, 4 equations, 5 figures, 3 tables)

This paper contains 31 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Planning Transformer Architecture Diagram showing the input sequence construction and the two output heads. RSA memory denotes the agent's memory of the more recent rtg, states, actions.
  • Figure 2: Model Inference pipeline. In Stage 1 we use PDT's planning head to generate $\rho$ planning tokens, and in Stage 2 we switch to the action head to generate the actions for the agent conditioned on the Plan
  • Figure 3: Normalized score of model on a selection of environments with Plans and without Plans, demonstrating the efficacy of using Plans.
  • Figure 4: Visualisation of Plans on antmaze-large environment. The rainbow coloured line is the actual path taken by the ant, the star is the goal, and the solid coloured lines with key-points (beginning at the red dots) are the Plans. On the left is a success case where the agent reached the goal and on the right is a failure case where the model got stuck. In both cases the Plans provide interpretability.
  • Figure 5: A snapshot of the attention map taken halfway through a run on the AntMaze medium-diverse environment.