Table of Contents
Fetching ...

Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Kyumin Hwang, Wonhyeok Choi, Kiljoon Han, Wonjoon Choi, Minwoo Choi, Yongcheon Na, Minwoo Park, Sunghoon Im

TL;DR

This work enables real-time full-surround monocular depth estimation by transferring robust, scale-invariant depth knowledge from a foundation teacher to a lightweight FSMDE student via Cross-interaction Knowledge Distillation (CKD) and View-relational Knowledge Distillation (VRKD). The approach leverages a shared depth binning module and distills depth bin probabilities and inter-view relations to achieve metric-depth accuracy across all surround cameras. Empirical results on DDAD and nuScenes show consistent gains over supervised baselines and prior KD methods, with strong performance under real-time constraints and detailed ablations confirming the complementary benefits of CKD and VRKD. The framework demonstrates practical applicability for autonomous driving, bridging the gap between foundation-model generalization and efficient, accurate FSMDE deployment.

Abstract

Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

TL;DR

This work enables real-time full-surround monocular depth estimation by transferring robust, scale-invariant depth knowledge from a foundation teacher to a lightweight FSMDE student via Cross-interaction Knowledge Distillation (CKD) and View-relational Knowledge Distillation (VRKD). The approach leverages a shared depth binning module and distills depth bin probabilities and inter-view relations to achieve metric-depth accuracy across all surround cameras. Empirical results on DDAD and nuScenes show consistent gains over supervised baselines and prior KD methods, with strong performance under real-time constraints and detailed ablations confirming the complementary benefits of CKD and VRKD. The framework demonstrates practical applicability for autonomous driving, bridging the gap between foundation-model generalization and efficient, accurate FSMDE deployment.

Abstract

Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

Paper Structure

This paper contains 23 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Inference speed vs. RMSE trade-off curve for nuScenes dataset (upper-right is optimal). Each circle represents a model size, and the number inside each circle indicates the number of model parameters. (DA: DepthAnything yang2024depth, MDP2: Monodepth2 godard2019digging, HRD: HRDepth lyu2021hr, MViT: MonoViT zhao2022monovit)
  • Figure 2: Conceptual illustration of our method. (a) Leveraging an effective depth binning module from supervised methods, we perform scale-invariant distillation at the probability level, avoiding the scale sensitivity of output-level distillation. (b) We use a potential function between adjacent views to distill relational information.
  • Figure 3: Illustration of the proposed knowledge distillation schemes. Our method leverages a depth binning module with the same architecture as the teacher model, enabling effective knowledge distillation at the scale-invariant depth bin probability level.
  • Figure 4: Qualitative results of fine-tuned Monodepth2 (denoted as Base) and Monodepth2 + Ours (denoted as Ours) on DDAD dataset.
  • Figure 5: Additional qualitative results of fine-tuned MonoViT (denoted as Base) and MonoViT + Ours (denoted as Ours) on DDAD dataset. The second and third rows of each sample show the depth prediction, and the last two rows present the error map.
  • ...and 1 more figures