Beyond Visual Field of View: Perceiving 3D Environment with Echoes and Vision
Lingyu Zhu, Esa Rahtu, Hang Zhao
TL;DR
The paper tackles the problem of perceiving and navigating 3D environments when visual data are limited to a narrow field of view by leveraging binaural echoes. It introduces an end-to-end architecture that fuses echoes from multiple orientations with RGB to predict wide field-of-view depth maps and demonstrates how this extended depth improves embodied navigation, including a novel PointGoal echo navigation task. Key contributions include four echo encoders, a vision-echos fusion pipeline for wide FoV depth, and empirical evidence that echoes outperform RGB alone for navigation and meaningfully enhance performance when fused with vision. The work leverages SoundSpaces in Habitat across Replica and Matterport3D and shows that echolocation can provide holistic geometric cues, enabling robust navigation in large or unseen regions without additional cameras or sensors.
Abstract
This paper focuses on perceiving and navigating 3D environments using echoes and RGB image. In particular, we perform depth estimation by fusing RGB image with echoes, received from multiple orientations. Unlike previous works, we go beyond the field of view of the RGB and estimate dense depth maps for substantially larger parts of the environment. We show that the echoes provide holistic and in-expensive information about the 3D structures complementing the RGB image. Moreover, we study how echoes and the wide field-of-view depth maps can be utilised in robot navigation. We compare the proposed methods against recent baselines using two sets of challenging realistic 3D environments: Replica and Matterport3D. The implementation and pre-trained models will be made publicly available.
