Table of Contents
Fetching ...

Monocular Reconstruction of Neural Tactile Fields

Pavan Mantripragada, Siddhanth Deshmukh, Eadom Dessalene, Manas Desai, Yiannis Aloimonos

TL;DR

This work addresses the need for interaction-aware 3D scene representations by introducing neural tactile fields, a dense 3D map from location to predicted tactile response that can be inferred from a single monocular RGB image. The method extends a Large Reconstruction Model (LRM) with a finetuned triplane decoder to jointly predict geometry and a 3D tactile field S(x), supervised by a new high-resolution visuotactile dataset acquired with GelSight pressure measurements. The authors demonstrate improved volumetric and surface reconstruction over state-of-the-art monocular methods and show that predicted tactile fields enable interaction-aware planning in which paths navigate deformable regions while avoiding rigid obstacles. This approach provides a practical pathway to safety and efficiency in planning under contact-rich conditions, such as agricultural robotics, by forecasting how objects will resist or yield to contact without online physical interaction. The work also contributes a dataset and a scalable training framework that can initialize more advanced physically grounded reconstructions.

Abstract

Robots operating in the real world must plan through environments that deform, yield, and reconfigure under contact, requiring interaction-aware 3D representations that extend beyond static geometric occupancy. To address this, we introduce neural tactile fields, a novel 3D representation that maps spatial locations to the expected tactile response upon contact. Our model predicts these neural tactile fields from a single monocular RGB image -- the first method to do so. When integrated with off-the-shelf path planners, neural tactile fields enable robots to generate paths that avoid high-resistance objects while deliberately routing through low-resistance regions (e.g. foliage), rather than treating all occupied space as equally impassable. Empirically, our learning framework improves volumetric 3D reconstruction by $85.8\%$ and surface reconstruction by $26.7\%$ compared to state-of-the-art monocular 3D reconstruction methods (LRM and Direct3D).

Monocular Reconstruction of Neural Tactile Fields

TL;DR

This work addresses the need for interaction-aware 3D scene representations by introducing neural tactile fields, a dense 3D map from location to predicted tactile response that can be inferred from a single monocular RGB image. The method extends a Large Reconstruction Model (LRM) with a finetuned triplane decoder to jointly predict geometry and a 3D tactile field S(x), supervised by a new high-resolution visuotactile dataset acquired with GelSight pressure measurements. The authors demonstrate improved volumetric and surface reconstruction over state-of-the-art monocular methods and show that predicted tactile fields enable interaction-aware planning in which paths navigate deformable regions while avoiding rigid obstacles. This approach provides a practical pathway to safety and efficiency in planning under contact-rich conditions, such as agricultural robotics, by forecasting how objects will resist or yield to contact without online physical interaction. The work also contributes a dataset and a scalable training framework that can initialize more advanced physically grounded reconstructions.

Abstract

Robots operating in the real world must plan through environments that deform, yield, and reconfigure under contact, requiring interaction-aware 3D representations that extend beyond static geometric occupancy. To address this, we introduce neural tactile fields, a novel 3D representation that maps spatial locations to the expected tactile response upon contact. Our model predicts these neural tactile fields from a single monocular RGB image -- the first method to do so. When integrated with off-the-shelf path planners, neural tactile fields enable robots to generate paths that avoid high-resistance objects while deliberately routing through low-resistance regions (e.g. foliage), rather than treating all occupied space as equally impassable. Empirically, our learning framework improves volumetric 3D reconstruction by and surface reconstruction by compared to state-of-the-art monocular 3D reconstruction methods (LRM and Direct3D).
Paper Structure (24 sections, 19 equations, 8 figures, 2 tables)

This paper contains 24 sections, 19 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our tactile field prediction pipeline. A monocular RGB image is processed through a Large Reconstruction Model (LRM) architecture with a frozen vision backbone (indicated by ) and finetuned triplane decoder (indicated by ). The network outputs a continuous 3D tactile field prediction, which is supervised against the ground-truth neural tactile field constructed from physical interaction data (Section \ref{['sec:datasetcollection']}). Our model’s predictions encode the joint influence of object weight, compliance, and stability based solely on visual appearance.
  • Figure 2: Ground-truth neural tactile field construction (dataset collection stage). Multi-view posed RGB images are used to fit a NeRF reconstruction of the object. A pose-tracked GelSight Mini sensor (shown contacting the plant) records pressure measurements at multiple surface locations. Each GelSight image is converted to a dense 2D pressure map over the sensor skin. These pressure measurements are then spatially aligned with the NeRF coordinate frame using the tracked sensor pose ("Align Pressures"). Finally, the aligned pressure points are projected into the 3D NeRF volume ("Project Pressures"), where each voxel aggregates nearby pressure values to produce a continuous 3D neural tactile field that serves as ground-truth supervision for training our prediction network.
  • Figure 3: The 40 objects used for training and evaluation, spanning a range of weights, stiffnesses, and materials. Plant pots were rigidly mounted during data collection, while all other objects were free to move under contact.
  • Figure 4: Intersection-over-Union (Higher $\uparrow$ is better) for unseen object reconstruction as a function of the interaction threshold $\tau$. Our method consistently outperforms LRM across all $\tau$ values. We do not include Direct3D as a comparison as their predicted meshes are not view-aligned.
  • Figure 5: Chamfer Distance (Lower $\downarrow$ is better) for unseen object reconstruction as a function of the interaction threshold $\tau$. Lower values indicate more accurate surface reconstruction, with our method achieving consistently lower error than LRM and Direct3D across all thresholds.
  • ...and 3 more figures