Table of Contents
Fetching ...

RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

Michael Baltaxe, Dan Levi, Sagie Benaim

TL;DR

This work tackles monocular metric depth estimation (MMDE) for underrepresented classes in complex scenes. It introduces RAD, a retrieval-augmented framework that retrieves semantically similar RGB-D context to provide geometric proxies and fuses this information with the input via a dual-stream Vision Transformer and a matched cross-attention module. Training employs uncertainty-aware context sourcing and 3D augmentation, while inference uses retrieval-based context with reliable correspondences to refine depth predictions. Across NYU Depth v2, KITTI, and Cityscapes, RAD achieves strong improvements for rare classes while maintaining competitive performance on in-domain regions, demonstrating the value of targeted, geometry-guided context in long-tail MMDE.

Abstract

Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.

RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

TL;DR

This work tackles monocular metric depth estimation (MMDE) for underrepresented classes in complex scenes. It introduces RAD, a retrieval-augmented framework that retrieves semantically similar RGB-D context to provide geometric proxies and fuses this information with the input via a dual-stream Vision Transformer and a matched cross-attention module. Training employs uncertainty-aware context sourcing and 3D augmentation, while inference uses retrieval-based context with reliable correspondences to refine depth predictions. Across NYU Depth v2, KITTI, and Cityscapes, RAD achieves strong improvements for rare classes while maintaining competitive performance on in-domain regions, demonstrating the value of targeted, geometry-guided context in long-tail MMDE.

Abstract

Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
Paper Structure (27 sections, 4 equations, 13 figures, 5 tables)

This paper contains 27 sections, 4 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Illustration. Given an input image, RAD (using DepthAnything v2 backbone DepthAnything_V2) retrieves context views for highly uncertain objects of underrepresented classes (e.g., candles) to serve as structural geometric proxies. These are used as part of a dual-stream network to output an accurate monocular metric depth estimation, in comparison to the direct baseline of DepthAnything v2, fixing uncertain regions.
  • Figure 2: RAD Pipeline. Given an input image, a set of context samples is sourced (Sec. \ref{['sec:point matching']}) using either uncertainty aware image retrieval (both at training and inference) or 3D augmentation (only during training). Subsequently, spatial correspondences are established (Sec. \ref{['sec:point matching']}). These are used to infer depth via a dual-stream depth estimation network employing matched cross-attention (Sec. \ref{['sec:depth estimation network']}). Blue blocks indicate components used for training and inference, while the green block is only for training.
  • Figure 3: Uncertainty-aware retrieval flow. Pixel-wise depth uncertainty is calculated in parallel to image segmentation. We use these to keep only highly uncertain segments, masking the rest of the image. Given the masked image we retrieve relevant examples from the context/training set using DINO descriptors.
  • Figure 4: Matched Cross-Attention. (a) illustrates the modified attention architecture designed to enable effective information transfer from the context stream to the input stream. For each token $j$ in the input image, with query vector $Q_i[j]$, attention is computed using key and value matrices formed by concatenating the input’s keys $(K_i$) and values ($V_i$) with the matched context keys ($K_m(j)$) and values ($V_m(j)$). These matched matrices are constructed by selecting $j$'s matching context tokens from the full context matrices $K_c$ and $V_c$, respectively. (b) shows that matching tokens are defined as those located within a spatial neighborhood surrounding the matched point of $j$ in the context image.
  • Figure 5: Qualitative results for NYU Depth v2 (top two rows), KITTI (middle two rows) and Cityscapes (bottom two rows). We compare our method (RAD) to baselines DepthAnything v2 DepthAnything_V2, UniDepth v2 UniDepth_V2 and Metric3D v2 metric3d_v2. Best viewed zoomed in.
  • ...and 8 more figures