Table of Contents
Fetching ...

Estimating Indoor Scene Depth Maps from Ultrasonic Echoes

Junpei Honma, Akisato Kimura, Go Irie

TL;DR

This paper addresses indoor depth estimation using inaudible ultrasonic echoes, addressing the challenge that restricting to ultrasonic frequencies reduces accuracy. It introduces a training-time knowledge transfer method that uses audible echoes as auxiliary data through a spectrogram Mixup augmentation, enabling robust depth estimation from ultrasonic echoes. The approach, validated on the Replica dataset, achieves improved RMSE over baselines and demonstrates both quantitative gains and visually closer depth maps. This work enables practical ultrasonic-depth sensing in environments where audible sounds are undesirable or prohibited, broadening the applicability of echo-based depth estimation.

Abstract

Measuring 3D geometric structures of indoor scenes requires dedicated depth sensors, which are not always available. Echo-based depth estimation has recently been studied as a promising alternative solution. All previous studies have assumed the use of echoes in the audible range. However, one major problem is that audible echoes cannot be used in quiet spaces or other situations where producing audible sounds is prohibited. In this paper, we consider echo-based depth estimation using inaudible ultrasonic echoes. While ultrasonic waves provide high measurement accuracy in theory, the actual depth estimation accuracy when ultrasonic echoes are used has remained unclear, due to its disadvantage of being sensitive to noise and susceptible to attenuation. We first investigate the depth estimation accuracy when the frequency of the sound source is restricted to the high-frequency band, and found that the accuracy decreased when the frequency was limited to ultrasonic ranges. Based on this observation, we propose a novel deep learning method to improve the accuracy of ultrasonic echo-based depth estimation by using audible echoes as auxiliary data only during training. Experimental results with a public dataset demonstrate that our method improves the estimation accuracy.

Estimating Indoor Scene Depth Maps from Ultrasonic Echoes

TL;DR

This paper addresses indoor depth estimation using inaudible ultrasonic echoes, addressing the challenge that restricting to ultrasonic frequencies reduces accuracy. It introduces a training-time knowledge transfer method that uses audible echoes as auxiliary data through a spectrogram Mixup augmentation, enabling robust depth estimation from ultrasonic echoes. The approach, validated on the Replica dataset, achieves improved RMSE over baselines and demonstrates both quantitative gains and visually closer depth maps. This work enables practical ultrasonic-depth sensing in environments where audible sounds are undesirable or prohibited, broadening the applicability of echo-based depth estimation.

Abstract

Measuring 3D geometric structures of indoor scenes requires dedicated depth sensors, which are not always available. Echo-based depth estimation has recently been studied as a promising alternative solution. All previous studies have assumed the use of echoes in the audible range. However, one major problem is that audible echoes cannot be used in quiet spaces or other situations where producing audible sounds is prohibited. In this paper, we consider echo-based depth estimation using inaudible ultrasonic echoes. While ultrasonic waves provide high measurement accuracy in theory, the actual depth estimation accuracy when ultrasonic echoes are used has remained unclear, due to its disadvantage of being sensitive to noise and susceptible to attenuation. We first investigate the depth estimation accuracy when the frequency of the sound source is restricted to the high-frequency band, and found that the accuracy decreased when the frequency was limited to ultrasonic ranges. Based on this observation, we propose a novel deep learning method to improve the accuracy of ultrasonic echo-based depth estimation by using audible echoes as auxiliary data only during training. Experimental results with a public dataset demonstrate that our method improves the estimation accuracy.
Paper Structure (12 sections, 3 equations, 6 figures)

This paper contains 12 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of Our Idea. Top: All existing echo-based depth estimation methods use audible echo spectrograms during training and testing, which are not always used depending on the surrounding conditions of the target scene. Bottom: In this paper, we aim to mitigate this problem by using inaudible ultrasonic echo spectrograms during testing. Based on the observation that a straightforward approach that restricts the frequency band to the ultrasonic range leads to poor depth estimation accuracy, we propose an approach that uses audible echoes as auxiliary data only during training. We confirm that our method improves the depth estimation accuracy in terms of root mean squared error (RMSE) between the estimated and the ground truth depth maps.
  • Figure 2: Echo-based Depth Estimation Framework. A known chirp signal is emitted to the indoor scene and spectrograms of the multi-channel echoes from the microphone array are extracted. The features are fed into a convolutional neural network (CNN) to estimate the depth map of the scene. The CNN is trained to minimize the RMSE between the estimated and ground truth depth maps.
  • Figure 3: Results of Preliminary Experiments. RMSE values of all the frequency setups (lower is better). The blue and orange bars indicate the results using audible and ultrasonic sound sources, respectively.
  • Figure 4: Our Method. Generate an augmented echo by fusing an ultrasonic echo and an audible echo with a lower frequency band. Learning is performed to minimize the weighted sum of the two losses evaluated for the two depth maps estimated using ultrasonic and augmented echoes, respectively. The weight $\lambda$ is scheduled as the learning proceeds.
  • Figure 5: Quantitative Results. RMSE values of the ultrasonic echo only, augmented echo only, and the proposed method (lower is better).
  • ...and 1 more figures