Table of Contents
Fetching ...

Graph Inverse Reinforcement Learning from Diverse Videos

Sateesh Kumar, Jonathan Zamora, Nicklas Hansen, Rishabh Jangir, Xiaolong Wang

TL;DR

GraphIRL addresses the problem of learning reward functions from diverse third-person videos by transforming scenes into object-centric graphs and enforcing temporal progression through cycle-consistent graph embeddings. The approach uses an Interaction Network to model object interactions and derives a domain- and embodiment-invariant reward for standard RL training, enabling transfer from simulation to real robots without hand-crafted rewards. Experiments on cross-embodiment X-MAGICAL tasks and real-robot manipulation demonstrate robustness to visual domain shifts and often surpass manually designed rewards, with ablations validating the value of spatial interactions and data efficiency. Overall, GraphIRL advances scalable IRL from diverse video data, with practical implications for robot learning from human demonstrations.

Abstract

Research on Inverse Reinforcement Learning (IRL) from third-person videos has shown encouraging results on removing the need for manual reward design for robotic tasks. However, most prior works are still limited by training from a relatively restricted domain of videos. In this paper, we argue that the true potential of third-person IRL lies in increasing the diversity of videos for better scaling. To learn a reward function from diverse videos, we propose to perform graph abstraction on the videos followed by temporal matching in the graph space to measure the task progress. Our insight is that a task can be described by entity interactions that form a graph, and this graph abstraction can help remove irrelevant information such as textures, resulting in more robust reward functions. We evaluate our approach, GraphIRL, on cross-embodiment learning in X-MAGICAL and learning from human demonstrations for real-robot manipulation. We show significant improvements in robustness to diverse video demonstrations over previous approaches, and even achieve better results than manual reward design on a real robot pushing task. Videos are available at https://sateeshkumar21.github.io/GraphIRL .

Graph Inverse Reinforcement Learning from Diverse Videos

TL;DR

GraphIRL addresses the problem of learning reward functions from diverse third-person videos by transforming scenes into object-centric graphs and enforcing temporal progression through cycle-consistent graph embeddings. The approach uses an Interaction Network to model object interactions and derives a domain- and embodiment-invariant reward for standard RL training, enabling transfer from simulation to real robots without hand-crafted rewards. Experiments on cross-embodiment X-MAGICAL tasks and real-robot manipulation demonstrate robustness to visual domain shifts and often surpass manually designed rewards, with ablations validating the value of spatial interactions and data efficiency. Overall, GraphIRL advances scalable IRL from diverse video data, with practical implications for robot learning from human demonstrations.

Abstract

Research on Inverse Reinforcement Learning (IRL) from third-person videos has shown encouraging results on removing the need for manual reward design for robotic tasks. However, most prior works are still limited by training from a relatively restricted domain of videos. In this paper, we argue that the true potential of third-person IRL lies in increasing the diversity of videos for better scaling. To learn a reward function from diverse videos, we propose to perform graph abstraction on the videos followed by temporal matching in the graph space to measure the task progress. Our insight is that a task can be described by entity interactions that form a graph, and this graph abstraction can help remove irrelevant information such as textures, resulting in more robust reward functions. We evaluate our approach, GraphIRL, on cross-embodiment learning in X-MAGICAL and learning from human demonstrations for real-robot manipulation. We show significant improvements in robustness to diverse video demonstrations over previous approaches, and even achieve better results than manual reward design on a real robot pushing task. Videos are available at https://sateeshkumar21.github.io/GraphIRL .
Paper Structure (21 sections, 4 equations, 14 figures, 4 tables)

This paper contains 21 sections, 4 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: GraphIRL. We propose an approach for performing inverse reinforcement learning from diverse third-person videos via graph abstraction. Based on our learned reward functions, we successfully train image-based policies in simulation and deploy them on a real robot.
  • Figure 2: Overview. We extract object bounding boxes from video sequences using an off-the-shelf detector, and construct a graph abstraction of the scene. We model graph-abstracted object interactions using Interaction Networksbattaglia2016interaction, and learn a reward function by aligning video embeddings temporally. We then train image-based RL policies using our learned reward function, and deploy on a real robot.
  • Figure 3: Overview of X-MAGICAL task variants. We consider two environment variants and four embodiments for our simulated sweeping task experiments. Our work assesses the performance of IRL algorithms in both the Diverse and Standard environments across all four embodiments in the Same-Embodiment and Cross-Embodiment settings.
  • Figure 4: Cross-Embodiment Cross-Environment. Success rates of our method GraphIRL and baselines on (top) Standard Environment Pretraining $\rightarrow$ Diverse Environment RL and (bottom) Diverse Environment Pretraining $\rightarrow$ Standard Environment RL. All reported numbers are averaged over 5 seeds. Our approach performs favorably when compared to other baselines on both settings.
  • Figure 5: Robotic Manipulation. Success rates of our method GraphIRL and baselines on the tasks of Reach, Push and Peg in Box. All results are averaged over 5 seeds. We observe significant gains in performance specially over vision-based baselines due to large domain-gap
  • ...and 9 more figures