Distilling Monocular Foundation Model for Fine-grained Depth Completion
Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu
TL;DR
This work tackles depth completion from sparse LiDAR by leveraging monocular foundation models to provide dense supervision. It introduces a two-stage distillation: (1) data-generation and pre-training using monocular depth estimates and mesh-based LiDAR simulation to learn geometric features from unlabeled images, and (2) a SSI Loss that aligns the dense depth predictions with real-world scale during fine-tuning on sparse ground-truth data. The approach achieves state-of-the-art results on KITTI and NYUv2 benchmarks, ranked first on KITTI, and demonstrates strong generalization and detail preservation in complex outdoor scenes. By decoupling geometric feature learning from metric-scale supervision and enforcing scale-consistent alignment with monocular predictions, the method delivers dense, high-fidelity depth maps suitable for downstream 3D understanding tasks.
Abstract
Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking \textbf{first place} on the KITTI benchmark. Code is available at https://github.com/Sharpiless/DMD3C
