Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Yifeng Zhu; Arisrei Lim; Peter Stone; Yuke Zhu

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Yifeng Zhu, Arisrei Lim, Peter Stone, Yuke Zhu

TL;DR

This paper tackles vision-based robot manipulation from a single human video in open-world settings. It introduces ORION, an object-centric framework that builds Open-world Object Graphs (OOGs) from the demonstration and generates a manipulation plan to guide a robot, achieving generalization across backgrounds, camera viewpoints, and unseen object instances. The approach combines plan generation from video with SE(3) trajectory optimization and impedance-controlled execution, and demonstrates robustness to RGB-D vs RGB demonstrations. Key contributions include the formal problem formulation for open-world imitation from observation, the OOG representation, and a one-video policy construction that scales to long-horizon tasks. The results show competitive performance and strong generalization, with insights from ablations and RGB-only variants highlighting the effectiveness of object-centric reasoning and TAP-based keypoint tracking.

Abstract

This work presents an object-centric approach to learning vision-based manipulation skills from human videos. We investigate the problem of robot manipulation via imitation in the open-world setting, where a robot learns to manipulate novel objects from a single video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB or RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices and to generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, using RGB-D and RGB-only demonstration videos. Across varied tasks and demonstration types (RGB-D / RGB), we observe an average success rate of 74.4%, demonstrating the efficacy of ORION in learning from a single human video in the open world. Additional materials can be found on our project website: https://ut-austin-rpl.github.io/ORION-release.

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 7 figures, 1 table, 4 algorithms)

This paper contains 18 sections, 1 equation, 7 figures, 1 table, 4 algorithms.

Introduction
Problem Formulation
Method
Open-world Object Graph
Manipulation Plan Generation From $V$
Robot Policy To Synthesize Actions
Experiments
Experiment Setup
Experimental Results
Related Work
Conclusions
Additional Technical Details
Data Structure of an OOG.
Implementation Details
System Setup
...and 3 more sections

Figures (7)

Figure 1: Overview. We introduce ORION for tackling the problem of learning manipulation behaviors from single human video demonstrations. ORION first extracts a sequence of Open-World Object Graphs (OOGs), where each OOG models a keyframe state with task-relevant objects and hand information. Then, ORION leverages the OOG sequence to construct a manipulation policy that generalizes across varied initial conditions, specifically in four aspects: visual background, camera shifts, spatial layouts, and novel instances from the same object categories.
Figure 2: Overview of plan generation in ORION.ORION generates a manipulation plan from a given video $V$ in order for subsequent policies to synthesize actions. ORION first tracks objects and keypoints across the video frames. Then, keyframes are identified based on the velocity statistics of the keypoint trajectories. Finally, ORION generates an Open-world Object Graph (OOG) for every keyframe, resulting in a sequence of OOGs that serves as the spatiotemporal abstraction of the video. The figure is best viewed in color.
Figure 3: Overview of the ORION Policy. ORION first localizes task-relevant objects at test time and retrieves the matched OOG from the generated manipulation plan. Then, ORION uses the retrieved OOGs to predict the object motions by warping the object-centric feature trajectory from the video to match the test-time observation. The predicted trajectories are then used to optimize the SE(3) action sequence of the robot end effector, which is subsequently used to command the robot.
Figure 4: The upper part of the figure illustrates the following items: the initial and final frames of human videos for every task, the list of word descriptions provided along with the video, and the example images of initial states for policy evaluation. The lower part of the figure shows the overall evaluation of ORION over all seven tasks, including the success rates and the quantification of failed trials, separated by failure mode.
Figure 5: This figure shows results in the same format as Figure \ref{['fig:tasks']}, with the difference that demonstration videos do not contain depth information and that videos come from different sources. Note that the Mug-on-coaster task uses a different demonstration video to showcase compatibility with different object instances, camera angles, and video sources.
...and 2 more figures

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

TL;DR

Abstract

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)