Table of Contents
Fetching ...

Affordances from Human Videos as a Versatile Representation for Robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, Deepak Pathak

TL;DR

VRB introduces a robot-centric affordance representation—contact points and post-contact trajectories—learned from egocentric human videos to support imitation, exploration, goal-driven tasks, and discrete RL in real-world robotics. By extracting c and tau via hand detection, skin segmentation, GMMs, and egomotion compensation, and by using a ResNet/Transformer architecture with multi-modal outputs, VRB provides transferable priors that accelerate learning across tasks and platforms. The approach is extensively validated in the wild, showing improved data quality for imitation, boosted exploration efficiency, faster goal-conditioned learning, and effective action-space discretization, with representations that transfer to control better than strong baselines. VRB thus offers a practical pathway to leverage vast human-video data for robust, real-world robotic manipulation.

Abstract

Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild. Results, visualizations and videos at https://robo-affordances.github.io/

Affordances from Human Videos as a Versatile Representation for Robotics

TL;DR

VRB introduces a robot-centric affordance representation—contact points and post-contact trajectories—learned from egocentric human videos to support imitation, exploration, goal-driven tasks, and discrete RL in real-world robotics. By extracting c and tau via hand detection, skin segmentation, GMMs, and egomotion compensation, and by using a ResNet/Transformer architecture with multi-modal outputs, VRB provides transferable priors that accelerate learning across tasks and platforms. The approach is extensively validated in the wild, showing improved data quality for imitation, boosted exploration efficiency, faster goal-conditioned learning, and effective action-space discretization, with representations that transfer to control better than strong baselines. VRB thus offers a practical pathway to leverage vast human-video data for robust, real-world robotic manipulation.

Abstract

Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild. Results, visualizations and videos at https://robo-affordances.github.io/
Paper Structure (28 sections, 3 equations, 13 figures, 5 tables, 3 algorithms)

This paper contains 28 sections, 3 equations, 13 figures, 5 tables, 3 algorithms.

Figures (13)

  • Figure 1: We leverage human videos to learn visual affordances that can be deployed on multiple real robot, in the wild, spanning several tasks and learning paradigms. Videos available at https://robo-affordances.github.io/.
  • Figure 2: VRB Overview. First, we learn an actionable representation of visual affordances from human videos: the model predicts contact points and trajectory waypoints with supervision from future frames. For robot deployment, we query the affordance model and convert its outputs to 3D actions to execute.
  • Figure 3: Robot Learning Paradigms : (a) Offline Data Collection -- Used to investigate the quality of the collected data. (b) Exploration -- The robot needs to use intrinsic rewards to improve (c) Goal-Conditioned Learning -- A desired task is specified via a goal image, used to provide reward. (d) Action Spaces -- Reduced action spaces are easier to search and allow for discrete control.
  • Figure 4: Qualitative affordance model outputs for VRB, HOIhoi, Hotspotshap and HAPhap, showing the predicted contact point region, and post-grasp trajectory (green arrow for VRB, red for HOIhoi). We can see that VRB produces the most meaningful affordances.
  • Figure 5: Exploration: Coincidental success of VRB in comparison to random exploration or the exploration based on HAP hap.
  • ...and 8 more figures