Table of Contents
Fetching ...

Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting

Matthew Strong, Boshu Lei, Aiden Swann, Wen Jiang, Kostas Daniilidis, Monroe Kennedy

TL;DR

This work addresses the challenge of constructing accurate 3D scene representations with very few views for robotic manipulation by introducing Next Best Sense, a framework that combines SAM2-based semantic depth alignment, depth-guided FisherRF-based next-best-view optimization, and uncertainty-driven tactile refinement to train 3D Gaussian Splatting online. The approach extends 3DGS with three core components: (1) robust few-shot initialization via SAM2 depth alignment and Pearson depth guidance, (2) depth-aware NBV selection using FisherRF with a depth information gain term, and (3) uncertainty-guided touch data to refine local geometry. Key contributions include a novel depth alignment strategy leveraging SAM2, depth-focused FisherRF extension for view and touch selection, and a real-time robotic pipeline demonstrating improvements in both geometry and visual fidelity on synthetic and real objects with limited views and touches. The results show that depth uncertainty is a strong driver of informative views and touches, enabling efficient, autonomous sensing for robotic manipulation with 3DGS.

Abstract

We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world, online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and surface normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection on a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to them, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes. For more information, please see our project page at https://arm.stanford.edu/next-best-sense.

Next Best Sense: Guiding Vision and Touch with FisherRF for 3D Gaussian Splatting

TL;DR

This work addresses the challenge of constructing accurate 3D scene representations with very few views for robotic manipulation by introducing Next Best Sense, a framework that combines SAM2-based semantic depth alignment, depth-guided FisherRF-based next-best-view optimization, and uncertainty-driven tactile refinement to train 3D Gaussian Splatting online. The approach extends 3DGS with three core components: (1) robust few-shot initialization via SAM2 depth alignment and Pearson depth guidance, (2) depth-aware NBV selection using FisherRF with a depth information gain term, and (3) uncertainty-guided touch data to refine local geometry. Key contributions include a novel depth alignment strategy leveraging SAM2, depth-focused FisherRF extension for view and touch selection, and a real-time robotic pipeline demonstrating improvements in both geometry and visual fidelity on synthetic and real objects with limited views and touches. The results show that depth uncertainty is a strong driver of informative views and touches, enabling efficient, autonomous sensing for robotic manipulation with 3DGS.

Abstract

We propose a framework for active next best view and touch selection for robotic manipulators using 3D Gaussian Splatting (3DGS). 3DGS is emerging as a useful explicit 3D scene representation for robotics, as it has the ability to represent scenes in a both photorealistic and geometrically accurate manner. However, in real-world, online robotic scenes where the number of views is limited given efficiency requirements, random view selection for 3DGS becomes impractical as views are often overlapping and redundant. We address this issue by proposing an end-to-end online training and active view selection pipeline, which enhances the performance of 3DGS in few-view robotics settings. We first elevate the performance of few-shot 3DGS with a novel semantic depth alignment method using Segment Anything Model 2 (SAM2) that we supplement with Pearson depth and surface normal loss to improve color and depth reconstruction of real-world scenes. We then extend FisherRF, a next-best-view selection method for 3DGS, to select views and touch poses based on depth uncertainty. We perform online view selection on a real robot system during live 3DGS training. We motivate our improvements to few-shot GS scenes, and extend depth-based FisherRF to them, where we demonstrate both qualitative and quantitative improvements on challenging robot scenes. For more information, please see our project page at https://arm.stanford.edu/next-best-sense.
Paper Structure (19 sections, 7 equations, 7 figures, 6 tables)

This paper contains 19 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Our method outperforms random view selection for few view scenes. We show a series of robot views, with the next view proposed by our method compared to random..
  • Figure 2: Mesh of Incorporating Lifted Depths (left) and Lifted SAM2 Depths (right). A semantic alignment provides a robust initialization for 3DGS.
  • Figure 3: SAM2 Alignment. Given an RGB image and depth image, we provide the RGB as input to a monocular depth model to get relative depths, and run the SAM2 automatic mask generator to get object and scene masks. We then align each object in the monocular depth with the corresponding sensor depth.
  • Figure 4: Tactile Data Supervision. We backproject points from a fisheye coordinate to form a triangle face $P_u, P_v, P_m$, which is projected onto image plane and rasterizes the corresponding area $u',v',m'$ for touch depth supervision.
  • Figure 5: FisherRF Ablations Qualitative results of Random (Left), FisherRF Color (Middle) and FisherRF Depth (Right).
  • ...and 2 more figures