Table of Contents
Fetching ...

Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders

Paul Koch, Jörg Krüger, Ankit Chowdhury, Oliver Heimann

TL;DR

Generalized metric depth understanding is crucial for robotics but existing RGB encoders lack metric-depth integration. Vanishing Depth (VD) introduces a self-supervised depth adapter with Positional Depth Encoding (PDE), randomized depth distributions, and a multi-scale balanced loss to extract and align depth features within frozen RGB encoders. VD achieves state-of-the-art performance on SUN-RGBD segmentation (56.05 mIoU), Void depth completion, and competitive 6D pose estimation without finetuning, with PDE offering improved depth precision and stability over norm-based encodings. This approach enables fast, modular RGBD perception across diverse depth distributions and densities, advancing practical deployment in multi-agent robotic systems.

Abstract

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.

Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders

TL;DR

Generalized metric depth understanding is crucial for robotics but existing RGB encoders lack metric-depth integration. Vanishing Depth (VD) introduces a self-supervised depth adapter with Positional Depth Encoding (PDE), randomized depth distributions, and a multi-scale balanced loss to extract and align depth features within frozen RGB encoders. VD achieves state-of-the-art performance on SUN-RGBD segmentation (56.05 mIoU), Void depth completion, and competitive 6D pose estimation without finetuning, with PDE offering improved depth precision and stability over norm-based encodings. This approach enables fast, modular RGBD perception across diverse depth distributions and densities, advancing practical deployment in multi-agent robotic systems.

Abstract

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.

Paper Structure

This paper contains 28 sections, 5 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Vanishing Depth: We use perlin and random noise to remove depth information from the original depth image. The remaining depth information is then passed through an RGBD encoder and FPN decoder network FPN. Using multi-scale noise masks, we calculate and evenly combine the scale-invariant loss for reconstructing the input depth and predicting missing depth inputs at multiple downstream stages of the network. After training, we remove the decoder, resulting in a pretrained RGBD encoder.
  • Figure 2: Visualisation of Positional Depth Encoding (PDE): Example of PDE with 32 channels (16 cosine and sine frequency pairs) and max depth $max_d = 15m$. The first frequency (c) captures large changes in depth image, while the final frequency (f) encodes minor depth changes.
  • Figure 3: Visualization of Embeddings: Following DINOv2 we use PCA to reduce embedded attention maps to three channels (RGB).
  • Figure 4: Noise-Shift: The threshold distribution for uniform and perlin noise is shifted from "easy" to "hard" instead of being constant. During the warm-up period, the distribution is set to an "easy" setup to prioritize depth reconstruction learning. Subsequently, the distribution gradually shifts to a "harder" setup that demands both depth reconstruction and prediction skills.
  • Figure 5: Perlin noise: For the in-painting masks, we generate Perlin Noise and convert them to binary masks given some target distribution. Thereby, we are able to remove rounded areas (organic shapes) from the input signal. Our experiments show that this in turn helps the model to reach better results.
  • ...and 7 more figures