Table of Contents
Fetching ...

Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids

Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu

TL;DR

This work advances sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids by proposing a practical recipe that unites automated real-to-sim tuning, a generalizable contact- and object-goal reward, sample-efficient learning via task-aware initialization and divide-and-conquer distillation, and a hybrid perception strategy with domain randomization. The approach enables zero-shot transfer to unseen real objects and adapts across hardware variations, achieving robust performance on grasp-and-reach, box lift, and bimanual handover tasks. Key contributions include an autotune real-to-sim module, a keypoint-based reward design, and a distillation pipeline that bridges single-task expertise and a generalist policy, all validated through extensive real and simulated experiments. Collectively, the results demonstrate that vision-based dexterous manipulation via sim-to-real RL is viable, scalable, and broadly applicable to real-world humanoid manipulation tasks.

Abstract

Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors -- highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.

Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids

TL;DR

This work advances sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids by proposing a practical recipe that unites automated real-to-sim tuning, a generalizable contact- and object-goal reward, sample-efficient learning via task-aware initialization and divide-and-conquer distillation, and a hybrid perception strategy with domain randomization. The approach enables zero-shot transfer to unseen real objects and adapts across hardware variations, achieving robust performance on grasp-and-reach, box lift, and bimanual handover tasks. Key contributions include an autotune real-to-sim module, a keypoint-based reward design, and a distillation pipeline that bridges single-task expertise and a generalist policy, all validated through extensive real and simulated experiments. Collectively, the results demonstrate that vision-based dexterous manipulation via sim-to-real RL is viable, scalable, and broadly applicable to real-world humanoid manipulation tasks.

Abstract

Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors -- highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.

Paper Structure

This paper contains 22 sections, 2 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview. We train a humanoid robot with two multi-fingered hands to perform a range of contact-rich dexterous manipulation tasks on diverse objects. Observations are obtained from a third-view camera, an egocentric camera, and robot proprioception. Our reinforcement learning policies generalize zero-shot to unseen real-world objects with varying physical properties (e.g. shape, size, color, material, mass) and remain robust against force disturbances. We also validate the adaptability of our approach on two hardware variations.
  • Figure 2: A sim-to-real RL recipe for vision-based dexterous manipulation. We close the environment modeling gap between simulation and reality through an automated real-to-sim tuning module, design generalizable task rewards by disentangling each manipulation task into contact states and object states, improve sample efficiency of policy training by using task-aware hand poses and divide-and-conquer distillation, and transfer vision-based policies to the real world with a mixture of sparse and dense object representations.
  • Figure 3: Policies learned in simulation. Left: grasp-and-reach; middle: box lift; right: bimanual handover (right-to-left, left-to-right).
  • Figure 4: Training grasp-and-reach policy with different object sets. Each curve is from 10 runs with different random seeds. Left: training with complex objects v.s. simple geometric primitive objects. Right: training with differently grouped geometric objects.
  • Figure 5: Different contact patterns emerge from different placements of contact markers. Top: contact markers on the left and right side centers; middle: markers on the top and bottom side centers; bottom: markers on the bottom side edges.
  • ...and 2 more figures