Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Lingyu Zhu; Esa Rahtu; Hang Zhao

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Lingyu Zhu, Esa Rahtu, Hang Zhao

TL;DR

The paper tackles the problem of perceiving and navigating 3D environments when visual data are limited to a narrow field of view by leveraging binaural echoes. It introduces an end-to-end architecture that fuses echoes from multiple orientations with RGB to predict wide field-of-view depth maps and demonstrates how this extended depth improves embodied navigation, including a novel PointGoal echo navigation task. Key contributions include four echo encoders, a vision-echos fusion pipeline for wide FoV depth, and empirical evidence that echoes outperform RGB alone for navigation and meaningfully enhance performance when fused with vision. The work leverages SoundSpaces in Habitat across Replica and Matterport3D and shows that echolocation can provide holistic geometric cues, enabling robust navigation in large or unseen regions without additional cameras or sensors.

Abstract

This paper focuses on perceiving and navigating 3D environments using echoes and RGB image. In particular, we perform depth estimation by fusing RGB image with echoes, received from multiple orientations. Unlike previous works, we go beyond the field of view of the RGB and estimate dense depth maps for substantially larger parts of the environment. We show that the echoes provide holistic and in-expensive information about the 3D structures complementing the RGB image. Moreover, we study how echoes and the wide field-of-view depth maps can be utilised in robot navigation. We compare the proposed methods against recent baselines using two sets of challenging realistic 3D environments: Replica and Matterport3D. The implementation and pre-trained models will be made publicly available.

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

TL;DR

Abstract

Paper Structure (32 sections, 2 equations, 7 figures, 4 tables)

This paper contains 32 sections, 2 equations, 7 figures, 4 tables.

Introduction
Related Work
Audio-Visual Learning:
Spatial Reasoning with Echoes:
Monocular Depth Estimation:
Learning to Navigate in 3D Environments:
Predicting Wide Field of View Depth Maps from Echoes and RGB
Overview
Depth Estimation From Echoes
Depth Estimation from Echoes and RGB
Estimating depth maps beyond the visual field of view:
Extending depth prediction to complete unseen areas:
Navigating Using Echoes and RGB
Overview
PointGoal Echo Navigation Task Setup
...and 17 more sections

Figures (7)

Figure 1: Leveraging echoes to extend depth prediction over RGB FoV.
Figure 2: The framework of depth estimation from echoes and RGB image. It consists of four echo encoders, a vision encoder, and a depth decoder. The four echo responses represent the echoes received from four different orientations (e.g., front, right, back, left side) relative to the green "arrow" (target depth orientation). The shaded gray triangle represents the field of view for the RGB input or predicted depth.
Figure 3: The architecture of the PointGoal echo navigation.
Figure 4: Visualization of depth prediction in comparison to baselines. The depth from second and third columns are estimated from using only echoes.
Figure 5: Depth prediction using RGB FoV $\in (0, 120]$, w/o echoes (green) and w/ echoes (red). For the thresholded accuracy $\delta$ (last column), the curves with $\bullet$, $\ast$, and x denote the $\delta_{1.25}$, $\delta_{1.25^{2}}$, and $\delta_{1.25^{3}}$, respectively.
...and 2 more figures

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

TL;DR

Abstract

Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision

Authors

TL;DR

Abstract

Table of Contents

Figures (7)