RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
Michael Baltaxe, Dan Levi, Sagie Benaim
TL;DR
This work tackles monocular metric depth estimation (MMDE) for underrepresented classes in complex scenes. It introduces RAD, a retrieval-augmented framework that retrieves semantically similar RGB-D context to provide geometric proxies and fuses this information with the input via a dual-stream Vision Transformer and a matched cross-attention module. Training employs uncertainty-aware context sourcing and 3D augmentation, while inference uses retrieval-based context with reliable correspondences to refine depth predictions. Across NYU Depth v2, KITTI, and Cityscapes, RAD achieves strong improvements for rare classes while maintaining competitive performance on in-domain regions, demonstrating the value of targeted, geometry-guided context in long-tail MMDE.
Abstract
Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.
