Table of Contents
Fetching ...

AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation

Xiaohu Liu, Sascha Hornauer, Fabien Moutarde, Jialiang Lu

TL;DR

The paper tackles scale ambiguity in monocular depth estimation by introducing AVS-Net, which fuses ego-centric Echoes with RGB to extract scale information and produce scale-correct metric depth. It decomposes depth into a relative component learned via self-supervised or zero-shot methods and a scale component derived from Echoes, using a two-stage process that first generates a pseudo-dense metric depth and then applies a median-based scale correction. Evaluations on BatVision BV2 and BV1 show that Echoes-enhanced predictions outperform visual-only baselines and improve both relative-depth models and zero-shot metric-depth systems, validating Echoes as a scalable, plug-and-play scale source. The approach offers practical gains in generalization and applicability across audio-visual scenes and can be integrated with diverse depth-estimation models to achieve metric-depth predictions without extensive supervised data.

Abstract

Metric depth prediction from monocular videos suffers from bad generalization between datasets and requires supervised depth data for scale-correct training. Self-supervised training using multi-view reconstruction can benefit from large scale natural videos but not provide correct scale, limiting its benefits. Recently, reflecting audible Echoes off objects is investigated for improved depth prediction and was shown to be sufficient to reconstruct objects at scale even without a visual signal. Because Echoes travel at fixed speed, they have the potential to resolve ambiguities in object scale and appearance. However, predicting depth end-to-end from sound and vision cannot benefit from unsupervised depth prediction approaches, which can process large scale data without sound annotation. In this work we show how Echoes can benefit depth prediction in two ways: When learning metric depth learned from supervised data and as supervisory signal for scale-correct self-supervised training. We show how we can improve the predictions of several state-of-the-art approaches and how the method can scale-correct a self-supervised depth approach.

AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation

TL;DR

The paper tackles scale ambiguity in monocular depth estimation by introducing AVS-Net, which fuses ego-centric Echoes with RGB to extract scale information and produce scale-correct metric depth. It decomposes depth into a relative component learned via self-supervised or zero-shot methods and a scale component derived from Echoes, using a two-stage process that first generates a pseudo-dense metric depth and then applies a median-based scale correction. Evaluations on BatVision BV2 and BV1 show that Echoes-enhanced predictions outperform visual-only baselines and improve both relative-depth models and zero-shot metric-depth systems, validating Echoes as a scalable, plug-and-play scale source. The approach offers practical gains in generalization and applicability across audio-visual scenes and can be integrated with diverse depth-estimation models to achieve metric-depth predictions without extensive supervised data.

Abstract

Metric depth prediction from monocular videos suffers from bad generalization between datasets and requires supervised depth data for scale-correct training. Self-supervised training using multi-view reconstruction can benefit from large scale natural videos but not provide correct scale, limiting its benefits. Recently, reflecting audible Echoes off objects is investigated for improved depth prediction and was shown to be sufficient to reconstruct objects at scale even without a visual signal. Because Echoes travel at fixed speed, they have the potential to resolve ambiguities in object scale and appearance. However, predicting depth end-to-end from sound and vision cannot benefit from unsupervised depth prediction approaches, which can process large scale data without sound annotation. In this work we show how Echoes can benefit depth prediction in two ways: When learning metric depth learned from supervised data and as supervisory signal for scale-correct self-supervised training. We show how we can improve the predictions of several state-of-the-art approaches and how the method can scale-correct a self-supervised depth approach.

Paper Structure

This paper contains 17 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Visual-only scaling can provide erroneous results (middle row). While Echoes, which inherently contains scale information, can help resolve such scale ambiguity (down row).
  • Figure 2: Comparisons between pseudo-dense metric depth and final scale-corrected depth (proposed). Though the former provides globally correct scale information, the depth quality degrade because of over-fitting to low quality Ground Truth. While the later benefit from both correct scale and superior depth quality
  • Figure 3: Illustration of the proposed AVS-Net (upper two rows), which takes binaural STFT and RGB as input. The Audio-Visual latent vectors are fused by a multi-head cross attention module, to estimate metric bin centers and pseudo-dense metric depth. Scale factors are extracted from the later to combine with a relative depth map from pre-trained relative depth model and form the final metric depth.
  • Figure 4: Illustration of the self-supervised relative depth estimation method, inspired by multi-view approaches such as godard2019diggingzhou2017unsupervised
  • Figure 5: Qualitative examples, with Proposed denotes RGB-Echoes scaling. from a) to f) are based on Monodepth2, MonoVit, LiteMono, ZoeDepth, NeWCRFs, Jun et al., respectively. From left to right: input RGB image, Ground truth depth, estimated depth without scaling, estimated depth with Visual only scaling (AVS-Net(only-RGB)), with RGB-Echoes scaling (AVS-Net(RGB-Echoes)), absolute difference between Only Visual Scaling (OVS) and ground truth, and between proposed and ground truth
  • ...and 1 more figures