In-Context Imitation Learning via Next-Token Prediction

Letian Fu; Huang Huang; Gaurav Datta; Lawrence Yunliang Chen; William Chung-Ho Panitch; Fangchen Liu; Hui Li; Ken Goldberg

In-Context Imitation Learning via Next-Token Prediction

Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, William Chung-Ho Panitch, Fangchen Liu, Hui Li, Ken Goldberg

TL;DR

The paper introduces In-Context Robot Transformer (ICRT), a transformer-based policy that performs in-context imitation learning on a real robot by conditioning on sensorimotor trajectories as prompts, without updating its parameters during test time. It combines a vision encoder, modality-specific projectors, and a causal transformer to predict next actions conditioned on prompts, enabling zero-shot generalization to unseen tasks and objects in new environments. The authors present the ICRT-MT multi-task dataset and demonstrate that ICRT, particularly when pre-trained on MT data, outperforms state-of-the-art goal- and language-conditioned baselines in real-robot experiments, while identifying that prompt-loss ablations and model initialization choices significantly affect performance. The work highlights the practicality of training-free, prompt-based adaptation for multi-task robotics, and discusses limitations related to primitive generalization, fixed morphologies, and inference speed, pointing toward scalable future directions.

Abstract

We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor trajectories without relying on any linguistic data or reward function. This formulation enables flexible and training-free execution of new tasks at test time, achieved by prompting the model with sensorimotor trajectories of the new task composing of image observations, actions and states tuples, collected through human teleoperation. Experiments with a Franka Emika robot demonstrate that the ICRT can adapt to new tasks specified by prompts, even in environment configurations that differ from both the prompt and the training data. In a multitask environment setup, ICRT significantly outperforms current state-of-the-art next-token prediction models in robotics on generalizing to unseen tasks. Code, checkpoints and data are available on https://icrt.dev/

In-Context Imitation Learning via Next-Token Prediction

TL;DR

Abstract

Paper Structure (27 sections, 4 figures, 11 tables)

This paper contains 27 sections, 4 figures, 11 tables.

Introduction
Related Works
Multi-Task Imitation Learning for Robotics
In-Context Learning
Problem Statement
Approach
Data Formulation
Model Architecture
Experiments
Ablations
Ablations
Model Initialization
Training Dataset
No Prompt Loss
Limitations and Conclusion
...and 12 more sections

Figures (4)

Figure 2: Our physical setup with the Franka Emika robot, the wrist and side camera and the objects used in training and evaluation. We consider 6 primitives for training and choose "pick up and place" and "poke" as the primitives for evaluation (dark green).
Figure 3: Method Overview: (Left) We encode camera observations with a pre-trained vision transformer. Additionally, we encode proprioception with an MLP. We concatenate the visual latent and the proprioception's latent and use attention pooling to extract a feature $f_s$ as the current state representation. We encode the current action with an MLP to get $f_a$. (Right) We concatenate multiple trajectories of the same task and randomly sample the first $k$ trajectories as the prompt. A causal transformer autoregressively predicts the next series of tokens. We decode the tokens that are at the position of the state features to generate the next $h=16$ action via an MLP.
Figure 4: Example inference pipeline of ICRT on the task of picking up the radish and putting in the gray bowl. A human teleoperated demonstration trajectory consisting of image observations, proprioception and actions are provided as the prompt. ICRT takes the prompt and the current observation in a different environment to accomplish the task.
Figure 5: Illustrations of the prompt trajectories (top) and test scenes (bottom) for the pick up the black dog and place in the pink bowl task. Three prompt trajectories of different types are collected. The test scenes are different from all prompt trajectories and 5 tiers of scenes with different number of distractors are considered.

In-Context Imitation Learning via Next-Token Prediction

TL;DR

Abstract

In-Context Imitation Learning via Next-Token Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)