Table of Contents
Fetching ...

A Minimalist Prompt for Zero-Shot Policy Learning

Meng Song, Xuezhi Wang, Tanay Biradar, Yao Qin, Manmohan Chandraker

TL;DR

This work investigates zero-shot generalization in policy learning by asking what minimal prompt suffices to match or exceed demonstration-driven generalization. The authors show that conditioning a Decision Transformer on task parameters $\mathbf{c}$ yields zero-shot generalization on par with demonstration-based prompting, suggesting DTs implicitly recover task structure from prompts. They further introduce a learnable prompt $\mathbf{z}$, forming a minimalist prompt that extracts generic supervisory information to boost zero-shot performance across robotics tasks. Across control, manipulation, and navigation benchmarks, the minimalist prompting DT often outperforms baselines and demonstrates robust multi-skill generalization, reducing the need for task-specific demonstrations at deployment. These findings point to practical pathways for deploying generalizable robotic policies without task-specific demonstrations, by separating task specification from the supervision signal via structured prompts.

Abstract

Transformer-based methods have exhibited significant generalization ability when prompted with target-domain demonstrations or example solutions during inference. Although demonstrations, as a way of task specification, can capture rich information that may be hard to specify by language, it remains unclear what information is extracted from the demonstrations to help generalization. Moreover, assuming access to demonstrations of an unseen task is impractical or unreasonable in many real-world scenarios, especially in robotics applications. These questions motivate us to explore what the minimally sufficient prompt could be to elicit the same level of generalization ability as the demonstrations. We study this problem in the contextural RL setting which allows for quantitative measurement of generalization and is commonly adopted by meta-RL and multi-task RL benchmarks. In this setting, the training and test Markov Decision Processes (MDPs) only differ in certain properties, which we refer to as task parameters. We show that conditioning a decision transformer on these task parameters alone can enable zero-shot generalization on par with or better than its demonstration-conditioned counterpart. This suggests that task parameters are essential for the generalization and DT models are trying to recover it from the demonstration prompt. To extract the remaining generalizable information from the supervision, we introduce an additional learnable prompt which is demonstrated to further boost zero-shot generalization across a range of robotic control, manipulation, and navigation benchmark tasks.

A Minimalist Prompt for Zero-Shot Policy Learning

TL;DR

This work investigates zero-shot generalization in policy learning by asking what minimal prompt suffices to match or exceed demonstration-driven generalization. The authors show that conditioning a Decision Transformer on task parameters yields zero-shot generalization on par with demonstration-based prompting, suggesting DTs implicitly recover task structure from prompts. They further introduce a learnable prompt , forming a minimalist prompt that extracts generic supervisory information to boost zero-shot performance across robotics tasks. Across control, manipulation, and navigation benchmarks, the minimalist prompting DT often outperforms baselines and demonstrates robust multi-skill generalization, reducing the need for task-specific demonstrations at deployment. These findings point to practical pathways for deploying generalizable robotic policies without task-specific demonstrations, by separating task specification from the supervision signal via structured prompts.

Abstract

Transformer-based methods have exhibited significant generalization ability when prompted with target-domain demonstrations or example solutions during inference. Although demonstrations, as a way of task specification, can capture rich information that may be hard to specify by language, it remains unclear what information is extracted from the demonstrations to help generalization. Moreover, assuming access to demonstrations of an unseen task is impractical or unreasonable in many real-world scenarios, especially in robotics applications. These questions motivate us to explore what the minimally sufficient prompt could be to elicit the same level of generalization ability as the demonstrations. We study this problem in the contextural RL setting which allows for quantitative measurement of generalization and is commonly adopted by meta-RL and multi-task RL benchmarks. In this setting, the training and test Markov Decision Processes (MDPs) only differ in certain properties, which we refer to as task parameters. We show that conditioning a decision transformer on these task parameters alone can enable zero-shot generalization on par with or better than its demonstration-conditioned counterpart. This suggests that task parameters are essential for the generalization and DT models are trying to recover it from the demonstration prompt. To extract the remaining generalizable information from the supervision, we introduce an additional learnable prompt which is demonstrated to further boost zero-shot generalization across a range of robotic control, manipulation, and navigation benchmark tasks.
Paper Structure (33 sections, 5 equations, 3 figures, 15 tables, 2 algorithms)

This paper contains 33 sections, 5 equations, 3 figures, 15 tables, 2 algorithms.

Figures (3)

  • Figure 1: Architecture of the minimalist prompting decision transformer. At each time step $t$, the model receives the recent $K$ time step input trajectory prepended by a minimalist prompt (gray and green part) and outputs a sequence of actions until $\hat{{\mathbf{a}}}_t$. A minimalist prompt consists of a task parameter vector $[\mathbf{c}, \mathbf{c}, \mathbf{c}]$ and a learnable prompt $\mathbf{z}=[\mathbf{z_1}, \mathbf{z_2},\mathbf{z_3}]$.
  • Figure 2: Comparing full minimalist-prompting DT (Task-Learned-DT) with four baselines: Task-DT, Trajectory-DT, Pure-Learned-DT and DT. The zero-shot performance of each algorithm is evaluated through the entire training process on five benchmark problems and reported under the mean normalized score. Shaded regions show one standard deviation of three seeds.
  • Figure 3: Comparing full minimalist-prompting DT (Task-Learned-DT) with four baselines Task-DT, Trajectory-DT, Pure-Learned-DT and DT on training tasks. The seen task performance of each algorithm is evaluated through the entire training process on five benchmark problems and reported under the mean normalized score. Each training task is evaluated for 20 episodes. Shaded regions show one standard deviation of three seeds.