Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders
Paul Koch, Jörg Krüger, Ankit Chowdhury, Oliver Heimann
TL;DR
Generalized metric depth understanding is crucial for robotics but existing RGB encoders lack metric-depth integration. Vanishing Depth (VD) introduces a self-supervised depth adapter with Positional Depth Encoding (PDE), randomized depth distributions, and a multi-scale balanced loss to extract and align depth features within frozen RGB encoders. VD achieves state-of-the-art performance on SUN-RGBD segmentation (56.05 mIoU), Void depth completion, and competitive 6D pose estimation without finetuning, with PDE offering improved depth precision and stability over norm-based encodings. This approach enables fast, modular RGBD perception across diverse depth distributions and densities, advancing practical deployment in multi-agent robotic systems.
Abstract
Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.
