Table of Contents
Fetching ...

FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

Irving Fang, Kairui Shi, Xujin He, Siqi Tan, Yifan Wang, Hanwen Zhao, Hung-Jui Huang, Wenzhen Yuan, Chen Feng, Jing Zhang

TL;DR

FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors, and outperforms previously state-of-the-art sparse-view methods.

Abstract

Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and common-sense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.

FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction

TL;DR

FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors, and outperforms previously state-of-the-art sparse-view methods.

Abstract

Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and common-sense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.

Paper Structure

This paper contains 30 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of FusionSense. Inspired by human perception, FusionSense integrates common sense from foundation models with sparse-view data from both vision and touch through 3D Gaussian Splatting, enabling efficient and robust 3D reconstruction of a robot's surroundings. Our proposed system features three core modules: (i) robust global shape representation, (ii) active touch point selection on the object, and (iii) local geometric optimization.
  • Figure 2: We use visual hull and estimated depth to initial our Gaussians and use RGB, depth, and estimated normal to supervise the training.
  • Figure 3: (1) Point Cloud Extracted from $\boldsymbol{O}'$. (2) Part Segmentation from PartSLIP. (3) High Gradient Gaussians. (4) 10 Selected Touch Points $\boldsymbol{t}_i$.
  • Figure 4: Qualitative comparisons on novel view synthesis, depth estimation, and normal estimation under sparse observations. The comparison presents results from scenes with two challenging objects: a black bunny and a transparent Coca-Cola cup. Comparisons are made between (i) the reference (ground truth RGB images, depth images from a RealSense camera, and normal estimates generated by the DSINE monocular normal foundation model Dsine), (ii) the proposed FusionSense framework, and (iii) the DN-Splatter approach. Using sparse observations—9 views and 10 tactile contacts—FusionSense achieves higher image fidelity, more precise depth, and normal estimations compared to DN-Splatter turkulainen2024dn, which relies on 9 views.
  • Figure 5: Rendering Results Using 5 Views.