Table of Contents
Fetching ...

Generalization to New Sequential Decision Making Tasks with In-Context Learning

Sharath Chandra Raparthy, Eric Hambro, Robert Kirk, Mikael Henaff, Roberta Raileanu

TL;DR

This paper investigates how transformers can generalize to completely new sequential decision tasks using in-context learning. By pretraining on large offline datasets of multi-trajectory sequences with trajectory burstiness, the authors show that context consisting of full trajectories from the same task enables few-shot learning of unseen tasks like MiniHack and Procgen without weight updates. Key findings include that larger models, bigger and more diverse datasets, higher environment stochasticity, and increased trajectory burstiness all improve cross-task generalization. The approach offers a practical path to zero or few-shot adaptation in complex, stochastic sequential decision problems with real-world implications for robotics and autonomous systems.

Abstract

Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.

Generalization to New Sequential Decision Making Tasks with In-Context Learning

TL;DR

This paper investigates how transformers can generalize to completely new sequential decision tasks using in-context learning. By pretraining on large offline datasets of multi-trajectory sequences with trajectory burstiness, the authors show that context consisting of full trajectories from the same task enables few-shot learning of unseen tasks like MiniHack and Procgen without weight updates. Key findings include that larger models, bigger and more diverse datasets, higher environment stochasticity, and increased trajectory burstiness all improve cross-task generalization. The approach offers a practical path to zero or few-shot adaptation in complex, stochastic sequential decision problems with real-world implications for robotics and autonomous systems.

Abstract

Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.
Paper Structure (35 sections, 17 figures, 5 tables)

This paper contains 35 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Illustration of Train and Test Tasks. (Left) A collection of procedurally generated Procgen levels from the Fruitbot task, demonstrating the complexity and diversity inherent in the environment's design. (Middle) Tasks used for training. (Right) Tasks used for testing. Note that the test tasks are entirely distinct from the training tasks, and each of them is procedurally generated, consisting of multiple levels.
  • Figure 2: Experimental Setup: We create a dataset of expert trajectories by rolling out expert policies on $N$ tasks. Given these expert trajectories, we construct multi-trajectory sequences with trajectory burstiness $p_b$. A sequence is bursty when there are at least two trajectories in the sequence from the same level. However, note that these trajectories are typically different due to the environment's stochasticity. These multi-trajectory sequences then serve as input to the causal transformer, which we train to predict actions. During evaluation, we condition the transformer on a few expert trajectories from an unseen task, then rollout the transformer policy until the episode terminates.
  • Figure 3: Performance on New MiniHack Tasks comparing (1) our multi-trajectory transformer conditioned on different number of demonstrations from the same level, (2) Hashmap baseline conditioned on the same demonstrations, and (3) BC baseline conditioned on zero or one demonstration due to context length constraints. Our model outperforms both baselines when provided with at least one demonstration and its performance improves with the number of demonstration.
  • Figure 4: Performance on New Procgen Tasks comparing (1) our multi-trajectory transformer conditioned on different number of demonstrations from the same level, (2) Hashmap baseline conditioned on the same demonstrations, and (3) BC baseline conditioned on zero or one demonstration due to context length constraints. Our model outperforms both behavioral cloning baselines, is competitive with Hashmap on Plunder, and its performance improves with demonstrations.
  • Figure 5: Effect of Trajectory Burstiness: Mean performance (with std. across $3$ seeds) for different levels of trajectory burstiness (a), dataset sizes (b), model sizes (c) and numbers of training tasks (d). These factors all have a positive effect on in-context learning.
  • ...and 12 more figures