Table of Contents
Fetching ...

DODT: Enhanced Online Decision Transformer Learning through Dreamer's Actor-Critic Trajectory Forecasting

Eric Hanchen Jiang, Zhi Zhang, Dinghuai Zhang, Andrew Lizarraga, Chenheng Xu, Yasi Zhang, Siyan Zhao, Zhengjie Xu, Peiyu Yu, Yuer Tang, Deqian Kong, Ying Nian Wu

TL;DR

A novel approach that combines the Dreamer algorithm's ability to generate anticipatory trajectories with the adaptive learning strengths of the Online Decision Transformer to create a bidirectional enhancement loop that accelerates learning and showcases robustness in diverse and dynamic scenarios.

Abstract

Advancements in reinforcement learning have led to the development of sophisticated models capable of learning complex decision-making tasks. However, efficiently integrating world models with decision transformers remains a challenge. In this paper, we introduce a novel approach that combines the Dreamer algorithm's ability to generate anticipatory trajectories with the adaptive learning strengths of the Online Decision Transformer. Our methodology enables parallel training where Dreamer-produced trajectories enhance the contextual decision-making of the transformer, creating a bidirectional enhancement loop. We empirically demonstrate the efficacy of our approach on a suite of challenging benchmarks, achieving notable improvements in sample efficiency and reward maximization over existing methods. Our results indicate that the proposed integrated framework not only accelerates learning but also showcases robustness in diverse and dynamic scenarios, marking a significant step forward in model-based reinforcement learning.

DODT: Enhanced Online Decision Transformer Learning through Dreamer's Actor-Critic Trajectory Forecasting

TL;DR

A novel approach that combines the Dreamer algorithm's ability to generate anticipatory trajectories with the adaptive learning strengths of the Online Decision Transformer to create a bidirectional enhancement loop that accelerates learning and showcases robustness in diverse and dynamic scenarios.

Abstract

Advancements in reinforcement learning have led to the development of sophisticated models capable of learning complex decision-making tasks. However, efficiently integrating world models with decision transformers remains a challenge. In this paper, we introduce a novel approach that combines the Dreamer algorithm's ability to generate anticipatory trajectories with the adaptive learning strengths of the Online Decision Transformer. Our methodology enables parallel training where Dreamer-produced trajectories enhance the contextual decision-making of the transformer, creating a bidirectional enhancement loop. We empirically demonstrate the efficacy of our approach on a suite of challenging benchmarks, achieving notable improvements in sample efficiency and reward maximization over existing methods. Our results indicate that the proposed integrated framework not only accelerates learning but also showcases robustness in diverse and dynamic scenarios, marking a significant step forward in model-based reinforcement learning.

Paper Structure

This paper contains 8 sections, 3 equations, 4 figures, 1 table, 3 algorithms.

Figures (4)

  • Figure 1: The figure illustrates how the Dreamer's world model infers latent states from environmental observations and stores these trajectories in the replay buffer. Subsequently, the Decision Transformer, fine-tuned on a GPT-2 model, learns from these trajectories to determine the agent's actions.
  • Figure 2: This diagram illustrates the combined training approach of the Online Decision Transformer (left) and the Dreamer model (right). The Online Decision Transformer refines decision-making strategies using historical data and ongoing interactions stored in the relay buffer. Concurrently, the Dreamer model projects future trajectories, enriching the relay buffer with simulated experiences that enhance the predictive capabilities of the system. This integrated framework allows for dynamic adaptation and improved decision-making in complex environments.
  • Figure 3: This figure shows the convergence graphs of Online Decision Transformer (ODT) and Dreamer Online Decision Transformer (DODT) in various MuJoCo environments. The graphs depict normalized returns over steps for different environments: (a) Hopper-v2 to (h) Ant-v2 Replay.
  • Figure 4: This figure shows the number of benefited trajectories used from Dreamer to the Online Decision Transformer, which aided in achieving higher rewards for environments (a) Hopper-v2 to (h) Ant-v2 Replay.