Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting
Matthew Strong, Boshu Lei, Aiden Swann, Wen Jiang, Kostas Daniilidis, Monroe Kennedy
TL;DR
This work addresses the challenge of constructing accurate 3D scene representations with very few views for robotic manipulation by introducing Next Best Sense, a framework that combines SAM2-based semantic depth alignment, depth-guided FisherRF-based next-best-view optimization, and uncertainty-driven tactile refinement to train 3D Gaussian Splatting online. The approach extends 3DGS with three core components: (1) robust few-shot initialization via SAM2 depth alignment and Pearson depth guidance, (2) depth-aware NBV selection using FisherRF with a depth information gain term, and (3) uncertainty-guided touch data to refine local geometry. Key contributions include a novel depth alignment strategy leveraging SAM2, depth-focused FisherRF extension for view and touch selection, and a real-time robotic pipeline demonstrating improvements in both geometry and visual fidelity on synthetic and real objects with limited views and touches. The results show that depth uncertainty is a strong driver of informative views and touches, enabling efficient, autonomous sensing for robotic manipulation with 3DGS.
Abstract
We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world, online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and surface normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection on a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to them, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes. For more information, please see our project page at https://arm.stanford.edu/next-best-sense.
