Touch-GS: Visual-Tactile Supervised 3D Gaussian Splatting
Aiden Swann, Matthew Strong, Won Kyung Do, Gadiel Sznaier Camps, Mac Schwager, Monroe Kennedy
TL;DR
Touch-GS tackles few-shot 3D scene synthesis by fusing tactile data with monocular depth priors to supervise 3D Gaussian Splatting. It introduces GPIS to represent tactile information with uncertainty, conditions it with DenseTact data, and renders depth/uncertainty images that are fused with a monocular depth map via a per-pixel Bayesian update. A novel uncertainty-weighted depth supervision loss and GPIS-based initialization guide training, yielding substantial improvements over vision-only or touch-alone baselines, including challenging materials like mirrors and transparent objects. Real- and simulated-world experiments validate the approach, demonstrating improved geometry and background fidelity and highlighting practical benefits for robotic manipulation with limited views. The framework is modular and extensible to other NeRF-based representations, offering a pathway to more robust, multimodal scene understanding in robotics.
Abstract
In this work, we propose a novel method to supervise 3D Gaussian Splatting (3DGS) scenes using optical tactile sensors. Optical tactile sensors have become widespread in their use in robotics for manipulation and object representation; however, raw optical tactile sensor data is unsuitable to directly supervise a 3DGS scene. Our representation leverages a Gaussian Process Implicit Surface to implicitly represent the object, combining many touches into a unified representation with uncertainty. We merge this model with a monocular depth estimation network, which is aligned in a two stage process, coarsely aligning with a depth camera and then finely adjusting to match our touch data. For every training image, our method produces a corresponding fused depth and uncertainty map. Utilizing this additional information, we propose a new loss function, variance weighted depth supervised loss, for training the 3DGS scene model. We leverage the DenseTact optical tactile sensor and RealSense RGB-D camera to show that combining touch and vision in this manner leads to quantitatively and qualitatively better results than vision or touch alone in a few-view scene syntheses on opaque as well as on reflective and transparent objects. Please see our project page at http://armlabstanford.github.io/touch-gs
