Table of Contents
Fetching ...

Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement

Zhi Wang, Li Zhang, Wenhao Wu, Yuanheng Zhu, Dongbin Zhao, Chunlin Chen

TL;DR

Experimental results on MuJoCo and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites.

Abstract

A longstanding goal of artificial general intelligence is highly capable generalists that can learn from diverse experiences and generalize to unseen tasks. The language and vision communities have seen remarkable progress toward this trend by scaling up transformer-based models trained on massive datasets, while reinforcement learning (RL) agents still suffer from poor generalization capacity under such paradigms. To tackle this challenge, we propose Meta Decision Transformer (Meta-DT), which leverages the sequential modeling ability of the transformer architecture and robust task representation learning via world model disentanglement to achieve efficient generalization in offline meta-RL. We pretrain a context-aware world model to learn a compact task representation, and inject it as a contextual condition to the causal transformer to guide task-oriented sequence generation. Then, we subtly utilize history trajectories generated by the meta-policy as a self-guided prompt to exploit the architectural inductive bias. We select the trajectory segment that yields the largest prediction error on the pretrained world model to construct the prompt, aiming to encode task-specific information complementary to the world model maximally. Notably, the proposed framework eliminates the requirement of any expert demonstration or domain knowledge at test time. Experimental results on MuJoCo and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites. Our code is available at https://github.com/NJU-RL/Meta-DT.

Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement

TL;DR

Experimental results on MuJoCo and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites.

Abstract

A longstanding goal of artificial general intelligence is highly capable generalists that can learn from diverse experiences and generalize to unseen tasks. The language and vision communities have seen remarkable progress toward this trend by scaling up transformer-based models trained on massive datasets, while reinforcement learning (RL) agents still suffer from poor generalization capacity under such paradigms. To tackle this challenge, we propose Meta Decision Transformer (Meta-DT), which leverages the sequential modeling ability of the transformer architecture and robust task representation learning via world model disentanglement to achieve efficient generalization in offline meta-RL. We pretrain a context-aware world model to learn a compact task representation, and inject it as a contextual condition to the causal transformer to guide task-oriented sequence generation. Then, we subtly utilize history trajectories generated by the meta-policy as a self-guided prompt to exploit the architectural inductive bias. We select the trajectory segment that yields the largest prediction error on the pretrained world model to construct the prompt, aiming to encode task-specific information complementary to the world model maximally. Notably, the proposed framework eliminates the requirement of any expert demonstration or domain knowledge at test time. Experimental results on MuJoCo and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites. Our code is available at https://github.com/NJU-RL/Meta-DT.

Paper Structure

This paper contains 15 sections, 5 equations, 13 figures, 13 tables, 5 algorithms.

Figures (13)

  • Figure 1: The overview of Meta-DT. We pretrain a context-aware world model to accurately disentangle task-specific information. It contains a context encoder $E_{\psi}$ that abstracts recent $h$-step history $\mu_{t}^i$ into a compact task representation $z_t^i$, and the generalized decoders ($R_{\phi}, T_{\varphi}$) that predict the reward and next state conditioned on $z_t^i$. Then, the inferred task representation is injected as a contextual condition to the causal transformer to guide task-oriented sequence generation. Finally, we design a self-guided prompt from history trajectories generated by the meta-policy at test time. We select the trajectory segment that yields the largest prediction error on the pretrained world model, aiming to encode task-relevant information complementary to the world model maximally.
  • Figure 2: The received return curves averaged over test tasks of Meta-DT and baselines using Medium datasets under an aligned few-shot setting.
  • Figure 3: The received return curves averaged over test tasks of Meta-DT and baselines using Medium datasets under an aligned zero-shot setting.
  • Figure 4: Test return curves of Meta-DT ablations using Medium datasets. w/o_context removes task representation, w/o_com removes the complementary way, and w/o_prompt removes the prompt.
  • Figure 5: Few-shot test curves of Meta-DT and baselines using Mixed datasets.
  • ...and 8 more figures