ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos
Zerui Chen, Shizhe Chen, Etienne Arlaud, Ivan Laptev, Cordelia Schmid
TL;DR
ViViDex presents a three-stage framework to learn vision-based dexterous manipulation from human videos by (1) extracting reference trajectories, (2) refining them with trajectory-guided RL to train a state-based policy, and (3) distilling successful rollouts into a unified visual policy without privileged object data. It enhances 3D visual representations via coordinate transformations and compares BC and diffusion training for the visual policy. Across three tasks, it achieves state-of-the-art results in simulation and real robot tests, with strong generalization to unseen objects and significant data efficiency, using few human videos. The approach reduces reward engineering, eliminates reliance on ground-truth object states, and provides a scalable path toward robust, vision-based dexterous control.
Abstract
In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-the-art approaches on three dexterous manipulation tasks.
