Table of Contents
Fetching ...

VideoDex: Learning Dexterity from Internet Videos

Kenneth Shaw, Shikhar Bahl, Deepak Pathak

TL;DR

<3-5 sentence high-level summary> VideoDex tackles the data bottleneck in dexterous manipulation by leveraging large-scale internet videos of humans to craft visual priors, action priors, and a physical prior via Neural Dynamic Policies. It retargets human hand motions to a robot embodiment and trains an open-loop policy that combines a visual encoder (R3M), action priors from retargeted human trajectories, and NDPs to produce smooth trajectories. Across seven real-world tasks with a high-DOF hand-arm system, VideoDex outperforms state-of-the-art baselines, with ablations showing the critical importance of action priors, two-stream architectures, and robust initial pose estimation. The work demonstrates that broad human-video data can effectively bootstrap dexterous robotics, reducing the amount of in-domain data required for strong performance and enabling generalization to unseen objects and even different grippers.

Abstract

To build general robotic agents that can operate in many environments, it is often imperative for the robot to collect experience in the real world. However, this is often not feasible due to safety, time, and hardware restrictions. We thus propose leveraging the next best thing as real-world experience: internet videos of humans using their hands. Visual priors, such as visual features, are often learned from videos, but we believe that more information from videos can be utilized as a stronger prior. We build a learning algorithm, VideoDex, that leverages visual, action, and physical priors from human video datasets to guide robot behavior. These actions and physical priors in the neural network dictate the typical human behavior for a particular robot task. We test our approach on a robot arm and dexterous hand-based system and show strong results on various manipulation tasks, outperforming various state-of-the-art methods. Videos at https://video-dex.github.io

VideoDex: Learning Dexterity from Internet Videos

TL;DR

<3-5 sentence high-level summary> VideoDex tackles the data bottleneck in dexterous manipulation by leveraging large-scale internet videos of humans to craft visual priors, action priors, and a physical prior via Neural Dynamic Policies. It retargets human hand motions to a robot embodiment and trains an open-loop policy that combines a visual encoder (R3M), action priors from retargeted human trajectories, and NDPs to produce smooth trajectories. Across seven real-world tasks with a high-DOF hand-arm system, VideoDex outperforms state-of-the-art baselines, with ablations showing the critical importance of action priors, two-stream architectures, and robust initial pose estimation. The work demonstrates that broad human-video data can effectively bootstrap dexterous robotics, reducing the amount of in-domain data required for strong performance and enabling generalization to unseen objects and even different grippers.

Abstract

To build general robotic agents that can operate in many environments, it is often imperative for the robot to collect experience in the real world. However, this is often not feasible due to safety, time, and hardware restrictions. We thus propose leveraging the next best thing as real-world experience: internet videos of humans using their hands. Visual priors, such as visual features, are often learned from videos, but we believe that more information from videos can be utilized as a stronger prior. We build a learning algorithm, VideoDex, that leverages visual, action, and physical priors from human video datasets to guide robot behavior. These actions and physical priors in the neural network dictate the typical human behavior for a particular robot task. We test our approach on a robot arm and dexterous hand-based system and show strong results on various manipulation tasks, outperforming various state-of-the-art methods. Videos at https://video-dex.github.io
Paper Structure (41 sections, 14 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 41 sections, 14 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: We re-target human videos as an action prior, use pretrainined embeddings as a visual prior, and use Neural Dynamical Policies (NDPs) bahl2020neural as a physical prior to complete many different tasks on a robotic hand.
  • Figure 2: The collection of train objects (left) and test objects (right) used for experimentation.
  • Figure 3: To use internet videos as pseudo-robot experience, we re-target human hand detections from the 3D MANO model MANO:SIGGRAPHASIA:2017 to 16 DoF robotic hand (LEAP) embodiment and we retarget the wrist from the moving camera to the xArm6 xarm embodiment. Videos at https://video-dex.github.io
  • Figure 4: To use human videos as an action prior for training policies, we re-target them to the robot embodiment. The detected human fingers are converted to the robot fingers using a learned energy function. The wrist is re-targeted using the detections and camera trajectory and transformed to the robot arm.
  • Figure 5: Tasks used in experiments. From left to right: pick, rotate, open, cover, uncover, place and push. See https://video-dex.github.io for videos of these tasks.
  • ...and 6 more figures