NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

Albert Luginov; Muhammad Shahzad

NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

Albert Luginov, Muhammad Shahzad

TL;DR

NimbleD addresses the need for fast, low-latency monocular depth estimation suitable for real-time metaverse applications by pairing self-supervision with pseudo-label supervision from a large vision model and large-scale video pre-training. It introduces a lightweight framework with a depth network, a camera network that learns intrinsics, and a teacher that provides pseudo-disparities, optimized through a simple, joint loss that combines SSL and PSL terms. The method demonstrates strong gains on KITTI while maintaining efficiency, and shows improved zero-shot generalization to NYUv2 and Make3D, enabled by training without camera intrinsics and extensive video pre-training. Overall, NimbleD enables small, fast MDE models to reach state-of-the-art SSL performance, with practical benefits for latency-constrained AR/VR and metaverse deployments.

Abstract

We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and acknowledgments are available at https://github.com/xapaxca/nimbled .

NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

TL;DR

Abstract

Paper Structure (14 sections, 7 equations, 5 figures, 7 tables)

This paper contains 14 sections, 7 equations, 5 figures, 7 tables.

Introduction
Related Work
Method
Self-superivsed Monocular Depth Estimation Enhanced by Pseudo-labels
Combined Self-superivsed and Pseudo-supervised Loss Function
Large-Scale Video Pre-training
Experiments
Datasets
Implementation Details
Results
Evaluation on KITTI.
Generalization.
Ablation Study
Conclusion

Figures (5)

Figure 1: Overview of the proposed NimbleD learning framework. The framework comprises a depth network that outputs multi-scale disparity predictions ($d_t^{\text{pred}}$), a camera network that outputs the relative camera pose ($T$) and camera intrinsic parameters ($K$), and a teacher depth network that generates pseudo-disparities ($d_t^{\text{pseudo}}$). The ultimate objective of the method is to minimize the loss between the predicted and pseudo-disparities ($\textit{PSL Loss}$) and to minimize the image reconstruction loss ($\textit{SSL Loss}$).
Figure 2: The dataset for large-scale video pre-training is curated from publicly available videos. It consists of three equally represented classes - city walking, driving, and hiking - offering a diverse range of outdoor environments.
Figure 3: Qualitative results on KITTI kitti Eigen split eigen_split, compared with the teacher model depthanything and baseline models monodepth2swiftdepthlitemono. NimbleD observably enhances the depth estimation quality of all baseline models. It identifies distant objects not detected by the teacher model and demonstrates improved handling of sky regions compared to the baseline models.
Figure 4: Zero-shot qualitative results on NYUv2 nyuv2, compared to the baseline models swiftdepthlitemono. NimbleD noticeably improves the generalization ability of both baseline models.
Figure 5: Zero-shot qualitative results on Make3D make3d, compared to the baseline models swiftdepthlitemono. NimbleD noticeably improves the generalization ability of both baseline models.

NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

TL;DR

Abstract

NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (5)