R+X: Retrieval and Execution from Everyday Human Videos

Georgios Papagiannis; Norman Di Palo; Pietro Vitiello; Edward Johns

R+X: Retrieval and Execution from Everyday Human Videos

Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, Edward Johns

TL;DR

R+X tackles learning robot skills from long, unlabelled first-person videos of humans performing everyday tasks by splitting the problem into retrieval and execution stages that rely on off-the-shelf foundation models. The retrieval stage uses a Vision-Language Model to extract task-relevant clips from a single wearable video, while the execution stage uses a few-shot in-context imitation learner (KAT) conditioned on the retrieved examples to generate robot hand trajectories without fine-tuning. Experiments across 12 household tasks show robust generalization to unseen objects, distractors, and varying layouts, with R+X outperforming monolithic language-conditioned policies. The work demonstrates a scalable, training-free pathway to robot skill acquisition from natural human data, enabling rapid deployment and continual learning.

Abstract

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x.

R+X: Retrieval and Execution from Everyday Human Videos

TL;DR

Abstract

Paper Structure (19 sections, 13 figures, 1 table)

This paper contains 19 sections, 13 figures, 1 table.

Introduction
Related Work
R+X: Retrieval and Execution
Retrieval: Extracting visual examples from a long, unlabelled video
Preprocessing Videos into a Sparse 3D Representation
Execution: Few-Shot, In-Context Imitation from Video Examples
Experiments
Results
Conclusion
Appendix
Additional Related Work
Processing the Long, Unlabelled Human Video
Stabilisation of First Person Videos
Extracting Gripper Actions from Human Hands
Tasks Details and Success Criteria
...and 4 more sections

Figures (13)

Figure 1: The main assumptions and constraints of many recent Learning from Observation methods.
Figure 2: Upon receiving a language command, R+X retrieves all the relevant clips from the human video. Each retrieved clip is transformed from pixels to a sparse 3D points representations of the hand joints movement and salient parts of the visual observation.
Figure 3: The visual 3D keypoints of the first frame of each of the Z videos obtained from the retrieval phase along with each extracted hand joint trajectory are used as context for KAT. To execute a skill, visual 3D keypoints are extracted from the live observation and used as input to KAT which generates a sequence of hand joints. By mapping this sequence to gripper poses the robot executes the desired task.
Figure 4: We test R+X on 12 everyday tasks, executed by a human in different rooms and with different distractors.
Figure 5: Examples of spatial, language and distractors generalisation. Gripper trajectories move from red to blue.
...and 8 more figures

R+X: Retrieval and Execution from Everyday Human Videos

TL;DR

Abstract

R+X: Retrieval and Execution from Everyday Human Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (13)