Table of Contents
Fetching ...

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

Zerui Chen, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid

TL;DR

ViViDex presents a three-stage framework to learn vision-based dexterous manipulation from human videos by (1) extracting reference trajectories, (2) refining them with trajectory-guided RL to train a state-based policy, and (3) distilling successful rollouts into a unified visual policy without privileged object data. It enhances 3D visual representations via coordinate transformations and compares BC and diffusion training for the visual policy. Across three tasks, it achieves state-of-the-art results in simulation and real robot tests, with strong generalization to unseen objects and significant data efficiency, using few human videos. The approach reduces reward engineering, eliminates reliance on ground-truth object states, and provides a scalable path toward robust, vision-based dexterous control.

Abstract

In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-the-art approaches on three dexterous manipulation tasks.

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

TL;DR

ViViDex presents a three-stage framework to learn vision-based dexterous manipulation from human videos by (1) extracting reference trajectories, (2) refining them with trajectory-guided RL to train a state-based policy, and (3) distilling successful rollouts into a unified visual policy without privileged object data. It enhances 3D visual representations via coordinate transformations and compares BC and diffusion training for the visual policy. Across three tasks, it achieves state-of-the-art results in simulation and real robot tests, with strong generalization to unseen objects and significant data efficiency, using few human videos. The approach reduces reward engineering, eliminates reliance on ground-truth object states, and provides a scalable path toward robust, vision-based dexterous control.

Abstract

In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-the-art approaches on three dexterous manipulation tasks.
Paper Structure (17 sections, 3 equations, 4 figures, 6 tables)

This paper contains 17 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The overall framework of our method for learning dexterous manipulation skills from human videos. It consists of three steps: extraction of reference trajectories from human videos, trajectory-guided state-based policy learning using RL, and vision-based policy learning using either the behavior cloning or the 3D diffusion policy.
  • Figure 2: Motion retargeting results for the Allegro hand and objects under different poses for selected DexYCB videos.
  • Figure 3: Qualitative comparison of state-based policies using different rewards for Protocol #1 and the Allegro hand. R1 (w/o hand reward in pre-grasp) leads to unstable grasps. R2 (w/o hand reward in manipulation) results in unnatural hand actions. Our proposed approach R3 uses hand rewards at both stages and achieves the best performance.
  • Figure 4: Illustrations of our real-world robot experimental setup and the performance of our proposed ViViDex algorithm.