Table of Contents
Fetching ...

VAIR: Visuo-Acoustic Implicit Representations for Low-Cost, Multi-Modal Transparent Surface Reconstruction in Indoor Scenes

Advaith V. Sethuraman, Onur Bagoren, Harikrishnan Seetharaman, Dalton Richardson, Joseph Taylor, Katherine A. Skinner

TL;DR

A novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces is proposed, and it is demonstrated that it can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction with transparent surface prediction.

Abstract

Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural representations to enable dense reconstruction of transparent surfaces in indoor scenes. We propose a novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces. We demonstrate that we can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction (point clouds or mesh) with transparent surface prediction. We evaluate our method's effectiveness qualitatively and quantitatively on a new dataset collected using a custom, low-cost sensing platform featuring RGB-D cameras and ultrasonic sensors. Our method exhibits significant improvement over state-of-the-art for transparent surface reconstruction.

VAIR: Visuo-Acoustic Implicit Representations for Low-Cost, Multi-Modal Transparent Surface Reconstruction in Indoor Scenes

TL;DR

A novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces is proposed, and it is demonstrated that it can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction with transparent surface prediction.

Abstract

Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural representations to enable dense reconstruction of transparent surfaces in indoor scenes. We propose a novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces. We demonstrate that we can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction (point clouds or mesh) with transparent surface prediction. We evaluate our method's effectiveness qualitatively and quantitatively on a new dataset collected using a custom, low-cost sensing platform featuring RGB-D cameras and ultrasonic sensors. Our method exhibits significant improvement over state-of-the-art for transparent surface reconstruction.

Paper Structure

This paper contains 15 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We augment the mapping and reconstruction capabilities of mobile robots with low-cost acoustic sensors that can measure sparse returns from glass surfaces. Our framework, VAIR, fully reconstructs glass surfaces, producing useful 3D geometry for robotic systems. VAIR's transparent surface prediction is shown in green.
  • Figure 2: VAIR is designed to integrate into existing robotic SLAM pipelines. The fusion of sparse acoustic sensor measurements and RGB-D imagery allows us to sense and reconstruct transparent surfaces. VAIR exploits semantic information from RGB images to further inform prediction and reconstruction of transparent surfaces and to learn an implicit representation of the scene. This representation can be queried for downstream robotic tasks.
  • Figure 3: Our Acoustic-Semantic Planar Projection (ASPP). We project rays through pixels predicted as transparent surface pixels, as specified by semantic segmentation, to further inform the extents of transparent surfaces in the scene. The projection of acoustic points onto the semantic rays is provided as input to VAIR.
  • Figure 4: Overview of VAIR. During training, VAIR takes as input latent codes for the scene and transparent surfaces. The respective decoders map the latent codes to density values $\hat{\sigma}^s_j$ and $\hat{\sigma}^t_j$ supervised by the scene point cloud $X^s_i$ and transparent surface point cloud $X^t_i$. During test time, we perform a generative latent optimization with randomly initialized latent codes. The latent codes are passed to decoders and losses are computed on density values $\hat{\sigma}^s_j$ and $\hat{\sigma}^t_j$ against scene geometry, semantic information, and sparse acoustic measurements in the form of the ASPP. VAIR is able to predict an implicit density field that completes the scene after finding latent codes that best explain the partial scene geometry.
  • Figure 5: Our custom sensing platform consists of an array of low-cost acoustic sensors, an Arduino, and a forward-facing Intel Realsense 435i RGB-D camera.
  • ...and 1 more figures