Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens
Joseph Clinton, Robert Lieck
TL;DR
The paper tackles offline reinforcement learning with long-horizon tasks by addressing compounding errors in autoregressive models. It introduces the Planning Transformer, which augments the Decision Transformer with Planning Tokens that encode high-level, long-horizon information predicted at regular intervals to guide the policy. Through Plan sampling, goal conditioning, and a unified training/inference pipeline, the approach achieves state-of-the-art or competitive results on challenging D4RL benchmarks while enhancing interpretability via plan visualizations and attention maps. The work demonstrates that explicit, dual-timescale planning signals can improve trajectory stitching and long-horizon reasoning in a simple, unified model, with potential extensions to online learning and other domains. Overall, the Planning Transformer provides a practical, interpretable pathway to integrate planning with powerful sequence models in Offline RL and beyond.
Abstract
Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.
