Table of Contents
Fetching ...

Touch-GS: Visual-Tactile Supervised 3D Gaussian Splatting

Aiden Swann, Matthew Strong, Won Kyung Do, Gadiel Sznaier Camps, Mac Schwager, Monroe Kennedy

TL;DR

Touch-GS tackles few-shot 3D scene synthesis by fusing tactile data with monocular depth priors to supervise 3D Gaussian Splatting. It introduces GPIS to represent tactile information with uncertainty, conditions it with DenseTact data, and renders depth/uncertainty images that are fused with a monocular depth map via a per-pixel Bayesian update. A novel uncertainty-weighted depth supervision loss and GPIS-based initialization guide training, yielding substantial improvements over vision-only or touch-alone baselines, including challenging materials like mirrors and transparent objects. Real- and simulated-world experiments validate the approach, demonstrating improved geometry and background fidelity and highlighting practical benefits for robotic manipulation with limited views. The framework is modular and extensible to other NeRF-based representations, offering a pathway to more robust, multimodal scene understanding in robotics.

Abstract

In this work, we propose a novel method to supervise 3D Gaussian Splatting (3DGS) scenes using optical tactile sensors. Optical tactile sensors have become widespread in their use in robotics for manipulation and object representation; however, raw optical tactile sensor data is unsuitable to directly supervise a 3DGS scene. Our representation leverages a Gaussian Process Implicit Surface to implicitly represent the object, combining many touches into a unified representation with uncertainty. We merge this model with a monocular depth estimation network, which is aligned in a two stage process, coarsely aligning with a depth camera and then finely adjusting to match our touch data. For every training image, our method produces a corresponding fused depth and uncertainty map. Utilizing this additional information, we propose a new loss function, variance weighted depth supervised loss, for training the 3DGS scene model. We leverage the DenseTact optical tactile sensor and RealSense RGB-D camera to show that combining touch and vision in this manner leads to quantitatively and qualitatively better results than vision or touch alone in a few-view scene syntheses on opaque as well as on reflective and transparent objects. Please see our project page at http://armlabstanford.github.io/touch-gs

Touch-GS: Visual-Tactile Supervised 3D Gaussian Splatting

TL;DR

Touch-GS tackles few-shot 3D scene synthesis by fusing tactile data with monocular depth priors to supervise 3D Gaussian Splatting. It introduces GPIS to represent tactile information with uncertainty, conditions it with DenseTact data, and renders depth/uncertainty images that are fused with a monocular depth map via a per-pixel Bayesian update. A novel uncertainty-weighted depth supervision loss and GPIS-based initialization guide training, yielding substantial improvements over vision-only or touch-alone baselines, including challenging materials like mirrors and transparent objects. Real- and simulated-world experiments validate the approach, demonstrating improved geometry and background fidelity and highlighting practical benefits for robotic manipulation with limited views. The framework is modular and extensible to other NeRF-based representations, offering a pathway to more robust, multimodal scene understanding in robotics.

Abstract

In this work, we propose a novel method to supervise 3D Gaussian Splatting (3DGS) scenes using optical tactile sensors. Optical tactile sensors have become widespread in their use in robotics for manipulation and object representation; however, raw optical tactile sensor data is unsuitable to directly supervise a 3DGS scene. Our representation leverages a Gaussian Process Implicit Surface to implicitly represent the object, combining many touches into a unified representation with uncertainty. We merge this model with a monocular depth estimation network, which is aligned in a two stage process, coarsely aligning with a depth camera and then finely adjusting to match our touch data. For every training image, our method produces a corresponding fused depth and uncertainty map. Utilizing this additional information, we propose a new loss function, variance weighted depth supervised loss, for training the 3DGS scene model. We leverage the DenseTact optical tactile sensor and RealSense RGB-D camera to show that combining touch and vision in this manner leads to quantitatively and qualitatively better results than vision or touch alone in a few-view scene syntheses on opaque as well as on reflective and transparent objects. Please see our project page at http://armlabstanford.github.io/touch-gs
Paper Structure (23 sections, 13 equations, 7 figures, 4 tables)

This paper contains 23 sections, 13 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Touch-GS combines monocular depth estimation priors with tactile data-informed implicit surfaces to generate high-quality 3DGS scenes from few training images. Adding touch data significantly enhances 3DGS quality (right) compared with RGB-D alone (left).
  • Figure 2: The GPIS is created by finding the 0-level-set of our GP-based SDF. We utilize the uncertainty at the 0-level-set to enhance the accuracy of our model. A z-axis slice of both SDF and uncertainty is shown above the bunnies.
  • Figure 3: Overview of our method, Touch-GS: 1. We utilize a monocular depth estimation algorithm, which is metrically aligned in a two phase process with RealSense depth and the GPIS output. 2. We condition a GP on the point cloud generated by DenseTact, rendering this into a series of depth and uncertainty images. 3. Monocular depth and tactile information are combined to produce a single set of training images that combine touch and vision.
  • Figure 4: We show our optimized SDF rendering process. $\alpha = .5$, thus each step halves the distance to the surface.
  • Figure 5: (a) shows the input point clouds and (b) shows the rendered 0-level-set colored by uncertainty. The method is able to fill in the gaps in the point cloud while showing more uncertainty in the interpolated areas. The last column is a real-world dataset.
  • ...and 2 more figures