From Simple to Complex Skills: The Case of In-Hand Object Reorientation
Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik
TL;DR
The paper tackles sim-to-real transfer for in-hand object reorientation by introducing a hierarchical policy that reuses pre-trained low-level rotation skills and a transformer-based, generalizable state estimator. A planner outputs rotation axes and residual actions to complement the low-level skills, while the estimator predicts relative pose using proprioception and skill feedback, enabling robust transfer to real hardware. Across simulation and real-world tests, the approach achieves faster training (e.g., up to 8× faster convergence), greater robustness to out-of-distribution perturbations, and successful manipulation of diverse, including symmetric and textureless objects. The work reduces manual reward engineering, demonstrates strong sim-to-real performance, and points to future integration of tactile sensing to handle slipping and improve long-term pose tracking.
Abstract
Learning policies in simulation and transferring them to the real world has become a promising approach in dexterous manipulation. However, bridging the sim-to-real gap for each new task requires substantial human effort, such as careful reward engineering, hyperparameter tuning, and system identification. In this work, we present a system that leverages low-level skills to address these challenges for more complex tasks. Specifically, we introduce a hierarchical policy for in-hand object reorientation based on previously acquired rotation skills. This hierarchical policy learns to select which low-level skill to execute based on feedback from both the environment and the low-level skill policies themselves. Compared to learning from scratch, the hierarchical policy is more robust to out-of-distribution changes and transfers easily from simulation to real-world environments. Additionally, we propose a generalizable object pose estimator that uses proprioceptive information, low-level skill predictions, and control errors as inputs to estimate the object pose over time. We demonstrate that our system can reorient objects, including symmetrical and textureless ones, to a desired pose.
