Table of Contents
Fetching ...

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

Yihao Qin, Yuanfei Wang, Hang Zhou, Peiran Liu, Hao Dong, Yiding Ji

TL;DR

Imaginary Planning Distillation is proposed, a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference and significantly outperforms several state-of-the-art value-based and transformer-based offline RL methods across diverse tasks.

Abstract

Decision transformer based sequential policies have emerged as a powerful paradigm in offline reinforcement learning (RL), yet their efficacy remains constrained by the quality of static datasets and inherent architectural limitations. Specifically, these models often struggle to effectively integrate suboptimal experiences and fail to explicitly plan for an optimal policy. To bridge this gap, we propose \textbf{Imaginary Planning Distillation (IPD)}, a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference. Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data. These components are utilized to identify suboptimal trajectories and augment them with reliable, imagined optimal rollouts generated via Model Predictive Control (MPC). A Transformer-based sequential policy is then trained on this enriched dataset, complemented by a value-guided objective that promotes the distillation of the optimal policy. By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference. Empirical evaluations on the D4RL benchmark demonstrate that IPD significantly outperforms several state-of-the-art value-based and transformer-based offline RL methods across diverse tasks.

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

TL;DR

Imaginary Planning Distillation is proposed, a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference and significantly outperforms several state-of-the-art value-based and transformer-based offline RL methods across diverse tasks.

Abstract

Decision transformer based sequential policies have emerged as a powerful paradigm in offline reinforcement learning (RL), yet their efficacy remains constrained by the quality of static datasets and inherent architectural limitations. Specifically, these models often struggle to effectively integrate suboptimal experiences and fail to explicitly plan for an optimal policy. To bridge this gap, we propose \textbf{Imaginary Planning Distillation (IPD)}, a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference. Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data. These components are utilized to identify suboptimal trajectories and augment them with reliable, imagined optimal rollouts generated via Model Predictive Control (MPC). A Transformer-based sequential policy is then trained on this enriched dataset, complemented by a value-guided objective that promotes the distillation of the optimal policy. By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference. Empirical evaluations on the D4RL benchmark demonstrate that IPD significantly outperforms several state-of-the-art value-based and transformer-based offline RL methods across diverse tasks.
Paper Structure (17 sections, 26 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 26 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of IPD. The process begins by learning a world model with uncertainty measure and a quasi-optimal value function from the original offline suboptimal dataset. Suboptimal states are identified using the value function, and their corresponding trajectories are replaced with imaginary rollouts generated via Model Predictive Control, using the learned world model and value function. Each generated trajectory is evaluated for uncertainty before incorporated into the enhanced dataset. Finally, a Transformer-based sequential policy is trained on this MPC-enhanced dataset, with additional supervision from the value function via action gradients and a dynamic return-to-go. By distilling the imaginary planning, which streamlines both MPC and dynamic programming, into the Transformer based policy, IPD enables the generation of superior actions.
  • Figure 2: Comparison between MPC and Greedy Q-Learning data augmentation in Walker2d-medium-replay task.
  • Figure 3: Performance Comparison between quasi-optimal value function and different setting of return-to-go in halfcheetah-medium-replay task.