Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Norman Di Palo; Edward Johns

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Norman Di Palo, Edward Johns

TL;DR

This work demonstrates that off-the-shelf text-pretrained Transformers can perform few-shot imitation learning in robotics by reframing visual observations and action trajectories as sequences of tokens called Keypoint Action Tokens (KAT). Visual observations are grounded into 3D keypoints via DINO-ViT descriptors, while end-effector poses are encoded as triplets of 3D points to form action tokens, all fed into a large language model without robotics-specific training. The method achieves state-of-the-art-like performance in several everyday manipulation tasks with as few as 10 demonstrations, and shows that larger language models improve in-context imitation capabilities, suggesting a promising direction for repurposing language models for embodied tasks. Limitations include scalability with many demonstrations and reliance on fixed keypoint counts, pointing to future work on dynamic keypoint extraction and potential finetuning for larger data regimes.

Abstract

We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens.

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

TL;DR

Abstract

Paper Structure (22 sections, 10 figures, 2 tables)

This paper contains 22 sections, 10 figures, 2 tables.

Introduction
Related Work
Method
Keypoint Tokens
Action Tokens
In-Context Imitation Learning via Pretrained Transformers
Experiments
Experimental Setup
Tasks
Baselines
Results on Few-Shot Imitation Learning
Vision: Investigations on Keypoint Tokens
Action: Investigations on Action Tokens
Free Robotics Lunch: Better Imitation Learning Machines by Scaling Language Models
Conclusion
...and 7 more sections

Figures (10)

Figure 1: An illustration of our pipeline. KAT transforms a visual observation into a sequence of keypoint tokens. From it, the text-pretrained Transformer (LLM) predicts action tokens, that are then executed as a trajectory of poses.
Figure 2: Illustration of the extraction pipeline of keypoint tokens.
Figure 3: Illustration of the action tokens, used in this work to represent $SE(3)$ end-effector poses.
Figure 4: The tasks we evaluated our method and the baselines on.
Figure 5: Success rate of each method as a function of the number of demos. While KAT outperforms the baselines in the few-shot regime ($\le 20$ demos), in-context learning struggles to improve as the number of demos increase even more. Plot shows mean and standard deviation across tasks.
...and 5 more figures

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

TL;DR

Abstract

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Authors

TL;DR

Abstract

Table of Contents

Figures (10)