Prompting Decision Transformer for Few-Shot Policy Generalization
Mengdi Xu, Yikang Shen, Shun Zhang, Yuchen Lu, Ding Zhao, Joshua B. Tenenbaum, Chuang Gan
TL;DR
The paper addresses offline RL generalization to unseen tasks by leveraging a prompt-based inductive bias. It introduces Prompt-based Decision Transformer (Prompt-DT), which uses short trajectory prompts of length $K^\ Star$ to condition a Transformer-based policy and achieve few-shot generalization without finetuning. In five MuJoCo benchmarks, Prompt-DT outperforms its variants and a strong offline meta-RL baseline MACAW, and remains robust to prompt length while enabling extrapolation to out-of-distribution tasks. The results suggest that trajectory prompts can encode task information effectively and motivate broader use of prompt-based architectures for data-efficient RL.
Abstract
Humans can leverage prior experience and learn novel tasks from a handful of demonstrations. In contrast to offline meta-reinforcement learning, which aims to achieve quick adaptation through better algorithm design, we investigate the effect of architecture inductive bias on the few-shot learning capability. We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL. We design the trajectory prompt, which contains segments of the few-shot demonstrations, and encodes task-specific information to guide policy generation. Our experiments in five MuJoCo control benchmarks show that Prompt-DT is a strong few-shot learner without any extra finetuning on unseen target tasks. Prompt-DT outperforms its variants and strong meta offline RL baselines by a large margin with a trajectory prompt containing only a few timesteps. Prompt-DT is also robust to prompt length changes and can generalize to out-of-distribution (OOD) environments.
