Table of Contents
Fetching ...

R+X: Retrieval and Execution from Everyday Human Videos

Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, Edward Johns

TL;DR

R+X tackles learning robot skills from long, unlabelled first-person videos of humans performing everyday tasks by splitting the problem into retrieval and execution stages that rely on off-the-shelf foundation models. The retrieval stage uses a Vision-Language Model to extract task-relevant clips from a single wearable video, while the execution stage uses a few-shot in-context imitation learner (KAT) conditioned on the retrieved examples to generate robot hand trajectories without fine-tuning. Experiments across 12 household tasks show robust generalization to unseen objects, distractors, and varying layouts, with R+X outperforming monolithic language-conditioned policies. The work demonstrates a scalable, training-free pathway to robot skill acquisition from natural human data, enabling rapid deployment and continual learning.

Abstract

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x.

R+X: Retrieval and Execution from Everyday Human Videos

TL;DR

R+X tackles learning robot skills from long, unlabelled first-person videos of humans performing everyday tasks by splitting the problem into retrieval and execution stages that rely on off-the-shelf foundation models. The retrieval stage uses a Vision-Language Model to extract task-relevant clips from a single wearable video, while the execution stage uses a few-shot in-context imitation learner (KAT) conditioned on the retrieved examples to generate robot hand trajectories without fine-tuning. Experiments across 12 household tasks show robust generalization to unseen objects, distractors, and varying layouts, with R+X outperforming monolithic language-conditioned policies. The work demonstrates a scalable, training-free pathway to robot skill acquisition from natural human data, enabling rapid deployment and continual learning.

Abstract

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x.
Paper Structure (19 sections, 13 figures, 1 table)

This paper contains 19 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: The main assumptions and constraints of many recent Learning from Observation methods.
  • Figure 2: Upon receiving a language command, R+X retrieves all the relevant clips from the human video. Each retrieved clip is transformed from pixels to a sparse 3D points representations of the hand joints movement and salient parts of the visual observation.
  • Figure 3: The visual 3D keypoints of the first frame of each of the Z videos obtained from the retrieval phase along with each extracted hand joint trajectory are used as context for KAT. To execute a skill, visual 3D keypoints are extracted from the live observation and used as input to KAT which generates a sequence of hand joints. By mapping this sequence to gripper poses the robot executes the desired task.
  • Figure 4: We test R+X on 12 everyday tasks, executed by a human in different rooms and with different distractors.
  • Figure 5: Examples of spatial, language and distractors generalisation. Gripper trajectories move from red to blue.
  • ...and 8 more figures