Table of Contents
Fetching ...

Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards

Irmak Guzey, Yinlong Dai, Georgy Savva, Raunaq Bhirangi, Lerrel Pinto

TL;DR

HuDOR tackles the human-to-robot dexterity gap by deriving object-centric, trajectory-matching rewards from a single in-scene human video and applying online residual RL to a four-fingered robot hand. The approach combines a VR-based data capture pipeline, pose transfer via inverse kinematics, and an object-point-tracking reward powered by language-grounded object masks and Co-Tracker trajectories. Empirical results across four tasks show HuDOR achieves substantial improvements over offline baselines and highlights the importance of online corrections for high-precision manipulation, with varying generalization to new objects and larger workspaces. This method enables online, teleoperation-free learning of dexterous policies, opening practical pathways for adapting human demonstrations to diverse robot morphologies in real-time.

Abstract

Training robots directly from human videos is an emerging area in robotics and computer vision. While there has been notable progress with two-fingered grippers, learning autonomous tasks for multi-fingered robot hands in this way remains challenging. A key reason for this difficulty is that a policy trained on human hands may not directly transfer to a robot hand due to morphology differences. In this work, we present HuDOR, a technique that enables online fine-tuning of policies by directly computing rewards from human videos. Importantly, this reward function is built using object-oriented trajectories derived from off-the-shelf point trackers, providing meaningful learning signals despite the morphology gap and visual differences between human and robot hands. Given a single video of a human solving a task, such as gently opening a music box, HuDOR enables our four-fingered Allegro hand to learn the task with just an hour of online interaction. Our experiments across four tasks show that HuDOR achieves a 4x improvement over baselines. Code and videos are available on our website, https://object-rewards.github.io.

Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards

TL;DR

HuDOR tackles the human-to-robot dexterity gap by deriving object-centric, trajectory-matching rewards from a single in-scene human video and applying online residual RL to a four-fingered robot hand. The approach combines a VR-based data capture pipeline, pose transfer via inverse kinematics, and an object-point-tracking reward powered by language-grounded object masks and Co-Tracker trajectories. Empirical results across four tasks show HuDOR achieves substantial improvements over offline baselines and highlights the importance of online corrections for high-precision manipulation, with varying generalization to new objects and larger workspaces. This method enables online, teleoperation-free learning of dexterous policies, opening practical pathways for adapting human demonstrations to diverse robot morphologies in real-time.

Abstract

Training robots directly from human videos is an emerging area in robotics and computer vision. While there has been notable progress with two-fingered grippers, learning autonomous tasks for multi-fingered robot hands in this way remains challenging. A key reason for this difficulty is that a policy trained on human hands may not directly transfer to a robot hand due to morphology differences. In this work, we present HuDOR, a technique that enables online fine-tuning of policies by directly computing rewards from human videos. Importantly, this reward function is built using object-oriented trajectories derived from off-the-shelf point trackers, providing meaningful learning signals despite the morphology gap and visual differences between human and robot hands. Given a single video of a human solving a task, such as gently opening a music box, HuDOR enables our four-fingered Allegro hand to learn the task with just an hour of online interaction. Our experiments across four tasks show that HuDOR achieves a 4x improvement over baselines. Code and videos are available on our website, https://object-rewards.github.io.

Paper Structure

This paper contains 24 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: HuDOR generates rewards from human videos by tracking points on the manipulable object, indicated by the rainbow-colored dots, over the trajectory. This allows for online training of multi-fingered robot hands given only a single video of a human solving the task (left) without any robot teleoperation. To optimize the robot's policy (middle), rewards are computed by matching the point movements of the robot policy $\mathcal{T}_R$ with those in the human video $\mathcal{T}_H$. In under an hour of online fine-tuning, our Allegro robot hand (right) is able to open the music box.
  • Figure 2: Illustration of the robot setup and trajectory transfer in HuDOR. ArUco markers are used for calibration. The human demonstration is collected in-scene, i.e. demonstrator is in the same scene as the robot. The VR headset is used solely for obtaining the fingertip positions with respect to the robot frame (illustrated with colored dots) and can be worn or attached to the setup as needed. World frame $W$ is visualized on the ArUco marker on the operation table.
  • Figure 3: An illustration of the human demonstrations (top rows) and the corresponding robot policies (bottom rows) trained using HuDOR. Our method does not require teleoperated robot data and learns to imitate human demonstrations through online interactions. Note the differences in hand motions between the learned robot policy and the human videos, reflecting the morphological differences.
  • Figure 4: An illustration showing how masked objects appear in both the robot and human videos. Points on the objects represent tracking in each video. Occlusions are indicated by hollow points rather than solid ones.
  • Figure 5: Rollouts of trained policies from HuDOR on four tasks are shown. For all tasks, validation is performed at various locations within the illustrated areas in the leftmost frames, while training is conducted using a single human video where the initial object configuration is in the middle of these areas. Success for each task is shown in the rightmost frames. Videos are best viewed on our website: https://object-rewards.github.io/.
  • ...and 4 more figures