Table of Contents
Fetching ...

Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation

Sudharshan Suresh, Haozhi Qi, Tingfan Wu, Taosha Fan, Luis Pineda, Mike Lambeta, Jitendra Malik, Mrinal Kalakrishnan, Roberto Calandra, Michael Kaess, Joseph Ortiz, Mustafa Mukadam

TL;DR

NeuralFeels advances in-hand manipulation perception by online learning a neural SDF of unknown objects while simultaneously optimizing object pose through a pose graph, fusing vision, tactile sensing, and proprioception. The modular frontend/backend design leverages pre-trained vision and tactile models with an online SLAM backbone, enabling robust pose and shape estimation even under occlusion and depth-noise. Across sim and real-world FeelSight experiments, the approach achieves an average object-shape F-score of $0.81$ and pose drift around $4.7\,\mathrm{mm}$ (improvable to $2.3\,\mathrm{mm}$ with CAD priors), with up to $94\%$ occlusion-robust gains over vision-only baselines. This work provides a practical, interpretable perception backbone for dexterous manipulation and establishes a public visuo-tactile benchmark for in-hand SLAM.

Abstract

To achieve human-level dexterity, robots must infer spatial awareness from multimodal sensing to reason over contact interactions. During in-hand manipulation of novel objects, such spatial awareness involves estimating the object's pose and shape. The status quo for in-hand perception primarily employs vision, and restricts to tracking a priori known objects. Moreover, visual occlusion of objects in-hand is imminent during manipulation, preventing current systems to push beyond tasks without occlusion. We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation. Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem. We study multimodal in-hand perception in simulation and the real-world, interacting with different objects via a proprioception-driven policy. Our experiments show final reconstruction F-scores of $81$% and average pose drifts of $4.7\,\text{mm}$, further reduced to $2.3\,\text{mm}$ with known CAD models. Additionally, we observe that under heavy visual occlusion we can achieve up to $94$% improvements in tracking compared to vision-only methods. Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation. We release our evaluation dataset of 70 experiments, FeelSight, as a step towards benchmarking in this domain. Our neural representation driven by multimodal sensing can serve as a perception backbone towards advancing robot dexterity. Videos can be found on our project website https://suddhu.github.io/neural-feels/

Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation

TL;DR

NeuralFeels advances in-hand manipulation perception by online learning a neural SDF of unknown objects while simultaneously optimizing object pose through a pose graph, fusing vision, tactile sensing, and proprioception. The modular frontend/backend design leverages pre-trained vision and tactile models with an online SLAM backbone, enabling robust pose and shape estimation even under occlusion and depth-noise. Across sim and real-world FeelSight experiments, the approach achieves an average object-shape F-score of and pose drift around (improvable to with CAD priors), with up to occlusion-robust gains over vision-only baselines. This work provides a practical, interpretable perception backbone for dexterous manipulation and establishes a public visuo-tactile benchmark for in-hand SLAM.

Abstract

To achieve human-level dexterity, robots must infer spatial awareness from multimodal sensing to reason over contact interactions. During in-hand manipulation of novel objects, such spatial awareness involves estimating the object's pose and shape. The status quo for in-hand perception primarily employs vision, and restricts to tracking a priori known objects. Moreover, visual occlusion of objects in-hand is imminent during manipulation, preventing current systems to push beyond tasks without occlusion. We combine vision and touch sensing on a multi-fingered hand to estimate an object's pose and shape during in-hand manipulation. Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem. We study multimodal in-hand perception in simulation and the real-world, interacting with different objects via a proprioception-driven policy. Our experiments show final reconstruction F-scores of % and average pose drifts of , further reduced to with known CAD models. Additionally, we observe that under heavy visual occlusion we can achieve up to % improvements in tracking compared to vision-only methods. Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation. We release our evaluation dataset of 70 experiments, FeelSight, as a step towards benchmarking in this domain. Our neural representation driven by multimodal sensing can serve as a perception backbone towards advancing robot dexterity. Videos can be found on our project website https://suddhu.github.io/neural-feels/
Paper Structure (28 sections, 2 equations, 20 figures)

This paper contains 28 sections, 2 equations, 20 figures.

Figures (20)

  • Figure 1: Visuo-tactile perception with NeuralFeels. Our method estimates pose and shape of novel objects (right) during in-hand manipulation, by learning neural field models online from a stream of vision, touch, and proprioception (left).
  • Figure 2: A visuo-tactile perception stack amidst interaction. An online representation of object shape and pose is built from vision, touch, and proprioception during in-hand manipulation. Raw sensor data is first fed into the frontend, which extracts visuo-tactile depth with our pre-trained models. Following this, the backend samples from the depth to train a neural signed distance field (SDF), while the pose graph tracks the posed neural field.
  • Figure 3: Summary of SLAM experiments.(a, b) We present aggregated statistics for SLAM over a combined 70 experiments (40 in simulation and 30 in the real-world), with each trial run over 5 different seeds. We compare across simulation and real-world to show low pose drift and high reconstruction accuracy. (c) Table 1 illustrates the number of trials that our method fails to track (and reconstruct) the object. (d) Representative examples of the final object pose and neural field renderings from the experiments. (e) The final 3D objects generated by marching cubes on our neural field. Here, we highlight the role tactile plays in both shape completion and shape refinement.
  • Figure 4: Representative SLAM results. In both real-world and simulation, we build an evolving neural SDF that integrates vision and touch while simultaneously tracking the object. We illustrate the input stream of RGB-D and tactile images, paired with the posed reconstruction at that timestep.
  • Figure 5: Neural pose tracking of known objects.(a) With known ground-truth shape, we can robustly track objects such as the Rubik's cube and potted meat can. (b) We observe reliable tracking performance, with average pose errors of $2\,\text{mm}$ through the sequence. (c) With a known object model and good visibility, touch plays the role of pose refinement.
  • ...and 15 more figures