Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu
TL;DR
This work advances sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids by proposing a practical recipe that unites automated real-to-sim tuning, a generalizable contact- and object-goal reward, sample-efficient learning via task-aware initialization and divide-and-conquer distillation, and a hybrid perception strategy with domain randomization. The approach enables zero-shot transfer to unseen real objects and adapts across hardware variations, achieving robust performance on grasp-and-reach, box lift, and bimanual handover tasks. Key contributions include an autotune real-to-sim module, a keypoint-based reward design, and a distillation pipeline that bridges single-task expertise and a generalist policy, all validated through extensive real and simulated experiments. Collectively, the results demonstrate that vision-based dexterous manipulation via sim-to-real RL is viable, scalable, and broadly applicable to real-world humanoid manipulation tasks.
Abstract
Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors -- highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.
