Table of Contents
Fetching ...

Distilling Monocular Foundation Model for Fine-grained Depth Completion

Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu

TL;DR

This work tackles depth completion from sparse LiDAR by leveraging monocular foundation models to provide dense supervision. It introduces a two-stage distillation: (1) data-generation and pre-training using monocular depth estimates and mesh-based LiDAR simulation to learn geometric features from unlabeled images, and (2) a SSI Loss that aligns the dense depth predictions with real-world scale during fine-tuning on sparse ground-truth data. The approach achieves state-of-the-art results on KITTI and NYUv2 benchmarks, ranked first on KITTI, and demonstrates strong generalization and detail preservation in complex outdoor scenes. By decoupling geometric feature learning from metric-scale supervision and enforcing scale-consistent alignment with monocular predictions, the method delivers dense, high-fidelity depth maps suitable for downstream 3D understanding tasks.

Abstract

Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking \textbf{first place} on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C

Distilling Monocular Foundation Model for Fine-grained Depth Completion

TL;DR

This work tackles depth completion from sparse LiDAR by leveraging monocular foundation models to provide dense supervision. It introduces a two-stage distillation: (1) data-generation and pre-training using monocular depth estimates and mesh-based LiDAR simulation to learn geometric features from unlabeled images, and (2) a SSI Loss that aligns the dense depth predictions with real-world scale during fine-tuning on sparse ground-truth data. The approach achieves state-of-the-art results on KITTI and NYUv2 benchmarks, ranked first on KITTI, and demonstrates strong generalization and detail preservation in complex outdoor scenes. By decoupling geometric feature learning from metric-scale supervision and enforcing scale-consistent alignment with monocular predictions, the method delivers dense, high-fidelity depth maps suitable for downstream 3D understanding tasks.

Abstract

Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking \textbf{first place} on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C

Paper Structure

This paper contains 17 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Depth completion models trained solely with $L_1$ loss and sparse ground truth produce incomplete and fragmented depth predictions. Our framework, however, demonstrates significant improvements by distilling knowledge from monocular foundation models and incorporating a scale- and shift-invariant loss (SSI Loss), resulting in more complete and accurate dense depth completion.
  • Figure 2: Illustration of our proposed first distillation stage with a data generation strategy to learn geometric features from monocular foundation models, which only requires unlabeled RGB images. We use the estimated monocular depth to re-construct the scene and then simulate the Lidar swap process to generate sparse points for training.
  • Figure 3: Illustration of our proposed second distillation stage utilizing foundation models for monocular depth estimation when fine-tuning on labeled datasets. Sparse ground truth provides real-world depth scale with L1 loss. Our method enhances this process by incorporating dense monocular depth for fine-grained supervision. However, monocular depth maps come with inherent scale and shift ambiguities. To address these challenges, we employ a Scale- and Shift-Invariant Loss (SSI Loss) that aligns the predictions with the dense monocular depth to match the real-world depth scale, ensuring more accurate depth completion.
  • Figure 4: Qualitative comparison of our proposed DMD$^{3}$C with several state-of-the-art methods on the KITTI benchmark, using public test results. Error maps highlight pixels with ground truth. In regions lacking ground truth, our method demonstrates notable improvements in depth completion, even though these areas are excluded from the evaluation metrics.
  • Figure 5: Qualitative comparison of depth completion methods. This figure demonstrates the performance of various depth completion models, including CFormer, LRRU, ImprovingDC, BP-Net, and our proposed DMD$^{3}$C. For each method, we show the input RGB images with sparse LiDAR points (left), along with the resulting completed depth maps and corresponding 3D point cloud reconstructions.
  • ...and 1 more figures