Table of Contents
Fetching ...

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Norman Di Palo, Edward Johns

TL;DR

This work demonstrates that off-the-shelf text-pretrained Transformers can perform few-shot imitation learning in robotics by reframing visual observations and action trajectories as sequences of tokens called Keypoint Action Tokens (KAT). Visual observations are grounded into 3D keypoints via DINO-ViT descriptors, while end-effector poses are encoded as triplets of 3D points to form action tokens, all fed into a large language model without robotics-specific training. The method achieves state-of-the-art-like performance in several everyday manipulation tasks with as few as 10 demonstrations, and shows that larger language models improve in-context imitation capabilities, suggesting a promising direction for repurposing language models for embodied tasks. Limitations include scalability with many demonstrations and reliance on fixed keypoint counts, pointing to future work on dynamic keypoint extraction and potential finetuning for larger data regimes.

Abstract

We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens.

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

TL;DR

This work demonstrates that off-the-shelf text-pretrained Transformers can perform few-shot imitation learning in robotics by reframing visual observations and action trajectories as sequences of tokens called Keypoint Action Tokens (KAT). Visual observations are grounded into 3D keypoints via DINO-ViT descriptors, while end-effector poses are encoded as triplets of 3D points to form action tokens, all fed into a large language model without robotics-specific training. The method achieves state-of-the-art-like performance in several everyday manipulation tasks with as few as 10 demonstrations, and shows that larger language models improve in-context imitation capabilities, suggesting a promising direction for repurposing language models for embodied tasks. Limitations include scalability with many demonstrations and reliance on fixed keypoint counts, pointing to future work on dynamic keypoint extraction and potential finetuning for larger data regimes.

Abstract

We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens.
Paper Structure (22 sections, 10 figures, 2 tables)

This paper contains 22 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An illustration of our pipeline. KAT transforms a visual observation into a sequence of keypoint tokens. From it, the text-pretrained Transformer (LLM) predicts action tokens, that are then executed as a trajectory of poses.
  • Figure 2: Illustration of the extraction pipeline of keypoint tokens.
  • Figure 3: Illustration of the action tokens, used in this work to represent $SE(3)$ end-effector poses.
  • Figure 4: The tasks we evaluated our method and the baselines on.
  • Figure 5: Success rate of each method as a function of the number of demos. While KAT outperforms the baselines in the few-shot regime ($\le 20$ demos), in-context learning struggles to improve as the number of demos increase even more. Plot shows mean and standard deviation across tasks.
  • ...and 5 more figures