General In-Hand Object Rotation with Vision and Touch
Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik
TL;DR
RotateIt addresses general in-hand object rotation by integrating vision, touch, and proprioception through a two-stage framework: an oracle policy trained in simulation with privileged information $\bm{z}_t$ encoding object shape and physics, and a visuotactile transformer that infers $\hat{\bm{z}}_t$ from history to deploy in the real world. The approach demonstrates that including object shape through PointNet, plus multimodal sensing and temporal Transformer modeling, yields significant gains over proprioception alone and approaches oracle performance for multi-axis rotation. Real-world experiments validate robust sim-to-real transfer across diverse objects and axes, with a single multi-axis policy matching single-axis specialists. These results advance general-purpose, dexterous in-hand manipulation and highlight the practical impact of visuotactile sensing and privileged-information distillation in robotics.
Abstract
We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing.
