Table of Contents
Fetching ...

Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

Xiankang He, Dongyan Guo, Hongji Li, Ruibo Li, Ying Cui, Chi Zhang

TL;DR

The paper tackles noise in pseudo-label distillation for zero-shot monocular depth estimation by critically analyzing depth normalization and introducing Cross-Context Distillation to fuse local detail with global structure. It further strengthens supervision with an assistant-guided approach that leverages diffusion-prior depth from a secondary teacher, formalized with losses like $L_{\text{Dis}}$, $L_{\text{sc}}$, and $L_{\text{lg}}$ and weights $\lambda_1$, $\lambda_2$, $\lambda_3$. Extensive experiments on standard benchmarks show state-of-the-art zero-shot performance across diverse scenes and architectures, validating improved pseudo-label reliability and generalization. The approach advances practical MDE by delivering finer depth details, better global consistency, and data-efficient training, enabling robust depth estimation in-the-wild and cross-domain scenarios.

Abstract

Recent advances in zero-shot monocular depth estimation(MDE) have significantly improved generalization by unifying depth distributions through normalized depth representations and by leveraging large-scale unlabeled data via pseudo-label distillation. However, existing methods that rely on global depth normalization treat all depth values equally, which can amplify noise in pseudo-labels and reduce distillation effectiveness. In this paper, we present a systematic analysis of depth normalization strategies in the context of pseudo-label distillation. Our study shows that, under recent distillation paradigms (e.g., shared-context distillation), normalization is not always necessary, as omitting it can help mitigate the impact of noisy supervision. Furthermore, rather than focusing solely on how depth information is represented, we propose Cross-Context Distillation, which integrates both global and local depth cues to enhance pseudo-label quality. We also introduce an assistant-guided distillation strategy that incorporates complementary depth priors from a diffusion-based teacher model, enhancing supervision diversity and robustness. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.

Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator

TL;DR

The paper tackles noise in pseudo-label distillation for zero-shot monocular depth estimation by critically analyzing depth normalization and introducing Cross-Context Distillation to fuse local detail with global structure. It further strengthens supervision with an assistant-guided approach that leverages diffusion-prior depth from a secondary teacher, formalized with losses like , , and and weights , , . Extensive experiments on standard benchmarks show state-of-the-art zero-shot performance across diverse scenes and architectures, validating improved pseudo-label reliability and generalization. The approach advances practical MDE by delivering finer depth details, better global consistency, and data-efficient training, enabling robust depth estimation in-the-wild and cross-domain scenarios.

Abstract

Recent advances in zero-shot monocular depth estimation(MDE) have significantly improved generalization by unifying depth distributions through normalized depth representations and by leveraging large-scale unlabeled data via pseudo-label distillation. However, existing methods that rely on global depth normalization treat all depth values equally, which can amplify noise in pseudo-labels and reduce distillation effectiveness. In this paper, we present a systematic analysis of depth normalization strategies in the context of pseudo-label distillation. Our study shows that, under recent distillation paradigms (e.g., shared-context distillation), normalization is not always necessary, as omitting it can help mitigate the impact of noisy supervision. Furthermore, rather than focusing solely on how depth information is represented, we propose Cross-Context Distillation, which integrates both global and local depth cues to enhance pseudo-label quality. We also introduce an assistant-guided distillation strategy that incorporates complementary depth priors from a diffusion-based teacher model, enhancing supervision diversity and robustness. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.

Paper Structure

This paper contains 23 sections, 11 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Zero-shot prediction on in-the-wild images. Our model, distilled from Genpercept xu2024diffusion and DepthAnythingv2 depth_anything_v2, outperforms other methods by delivering more accurate depth details and exhibiting superior generalization for monocular depth estimation on in-the-wild images.
  • Figure 2: Issue with Global Normalization (SSI). In (a), we compare two alignment strategies for the central $w/2, h/2$ region: (1) Global Least-Square, where alignment is applied to the full image before cropping, and (2) Local Least-Square, where alignment is performed on the cropped region. Metrics are computed on the cropped region. As shown in (b), the outperformed local strategy demonstrates that global normalization degrades local accuracy compared to local normalization.
  • Figure 3: Overview of Cross-Context Distillation. Our method combines local and global depth information to enhance the student model’s predictions. It includes two scenarios: (1) Shared-Context Distillation, where both models use the same image for distillation; and (2) Local-Global Distillation, where the teacher predicts depth for overlapping patches while the student predicts the full image. The Local-Global loss $\mathcal{L}_{\text{lg}}$ (Top Right) ensures consistency between local and global predictions, enabling the student to learn both fine details and broad structures, improving accuracy and robustness.
  • Figure 4: Normalization Strategies. We compare four normalization strategies: Global Norm ranftl2020midas, Hybrid Norm zhang2022hdn, Local Norm, and No Norm. The figure visualizes how each strategy processes pixels within the normalization region (Norm. Area). The red dot represents any pixel within the region.
  • Figure 5: Different Inputs Lead to Different Pseudo Labels. Global Depth: The teacher model predicts depth using the entire image, and the local region's prediction is cropped from the output. Local Depth: The teacher model directly takes the cropped local region as input, resulting in more refined and detailed depth estimates for that area, capturing finer details compared to using the entire image.
  • ...and 6 more figures