Table of Contents
Fetching ...

General In-Hand Object Rotation with Vision and Touch

Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik

TL;DR

RotateIt addresses general in-hand object rotation by integrating vision, touch, and proprioception through a two-stage framework: an oracle policy trained in simulation with privileged information $\bm{z}_t$ encoding object shape and physics, and a visuotactile transformer that infers $\hat{\bm{z}}_t$ from history to deploy in the real world. The approach demonstrates that including object shape through PointNet, plus multimodal sensing and temporal Transformer modeling, yields significant gains over proprioception alone and approaches oracle performance for multi-axis rotation. Real-world experiments validate robust sim-to-real transfer across diverse objects and axes, with a single multi-axis policy matching single-axis specialists. These results advance general-purpose, dexterous in-hand manipulation and highlight the practical impact of visuotactile sensing and privileged-information distillation in robotics.

Abstract

We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing.

General In-Hand Object Rotation with Vision and Touch

TL;DR

RotateIt addresses general in-hand object rotation by integrating vision, touch, and proprioception through a two-stage framework: an oracle policy trained in simulation with privileged information encoding object shape and physics, and a visuotactile transformer that infers from history to deploy in the real world. The approach demonstrates that including object shape through PointNet, plus multimodal sensing and temporal Transformer modeling, yields significant gains over proprioception alone and approaches oracle performance for multi-axis rotation. Real-world experiments validate robust sim-to-real transfer across diverse objects and axes, with a single multi-axis policy matching single-axis specialists. These results advance general-purpose, dexterous in-hand manipulation and highlight the practical impact of visuotactile sensing and privileged-information distillation in robotics.

Abstract

We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing.
Paper Structure (16 sections, 1 equation, 10 figures, 7 tables)

This paper contains 16 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Rotation over multiple axes by integrating proprioception, vision, and touch sensing. RotateIt is trained in simulation and deployed directly to the real-world, where it generalizes to diverse test objects without the need for fine-tuning. Please see our https://haozhi.io/rotateit/ for more videos.
  • Figure 2: An overview of our training pipeline. Trainable components are highlighted in green. In oracle policy training, we jointly optimize the privileged encoder and control policy using PPO. In the visuotactile policy training, we feed a sequence of visuotactile and proprioceptive inputs to a transformer to infer $\hat{\bm{z}}_t$. The visuotactile transformer is trained by minimizing the regression loss between $\bm{z}_t$ and $\hat{\bm{z}}_t$.
  • Figure 3: Training objects. We curated a diverse combination of objects from EGAD morrison2020egad, Google Scanned Objects downs2022google, YCB calli2015ycb, and ContactDB brahmbhatt2019contactdb. We filter out meshes with disconnected components and objects with a width/depth/height (w/d/h) ratio larger than 2.0.
  • Figure 4: Representation for Sim-to-Real Touch Sensing. In the simulation, we use discretized contact location provided by the simulator. In real-world, we detect the deformation by tracking colored regions of the sensor outputs, and parse the same information from a temporal stream of tactile images.
  • Figure 5: Representation for Sim-to-Real Vision Sensing In simulation, we use the object's foreground depth as the input. In real-world, to reduce the sim-to-real gap, we segment out the object's depth map using Segment-Anything.
  • ...and 5 more figures