Table of Contents
Fetching ...

NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

Albert Luginov, Muhammad Shahzad

TL;DR

NimbleD addresses the need for fast, low-latency monocular depth estimation suitable for real-time metaverse applications by pairing self-supervision with pseudo-label supervision from a large vision model and large-scale video pre-training. It introduces a lightweight framework with a depth network, a camera network that learns intrinsics, and a teacher that provides pseudo-disparities, optimized through a simple, joint loss that combines SSL and PSL terms. The method demonstrates strong gains on KITTI while maintaining efficiency, and shows improved zero-shot generalization to NYUv2 and Make3D, enabled by training without camera intrinsics and extensive video pre-training. Overall, NimbleD enables small, fast MDE models to reach state-of-the-art SSL performance, with practical benefits for latency-constrained AR/VR and metaverse deployments.

Abstract

We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and acknowledgments are available at https://github.com/xapaxca/nimbled .

NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training

TL;DR

NimbleD addresses the need for fast, low-latency monocular depth estimation suitable for real-time metaverse applications by pairing self-supervision with pseudo-label supervision from a large vision model and large-scale video pre-training. It introduces a lightweight framework with a depth network, a camera network that learns intrinsics, and a teacher that provides pseudo-disparities, optimized through a simple, joint loss that combines SSL and PSL terms. The method demonstrates strong gains on KITTI while maintaining efficiency, and shows improved zero-shot generalization to NYUv2 and Make3D, enabled by training without camera intrinsics and extensive video pre-training. Overall, NimbleD enables small, fast MDE models to reach state-of-the-art SSL performance, with practical benefits for latency-constrained AR/VR and metaverse deployments.

Abstract

We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and acknowledgments are available at https://github.com/xapaxca/nimbled .
Paper Structure (14 sections, 7 equations, 5 figures, 7 tables)

This paper contains 14 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of the proposed NimbleD learning framework. The framework comprises a depth network that outputs multi-scale disparity predictions ($d_t^{\text{pred}}$), a camera network that outputs the relative camera pose ($T$) and camera intrinsic parameters ($K$), and a teacher depth network that generates pseudo-disparities ($d_t^{\text{pseudo}}$). The ultimate objective of the method is to minimize the loss between the predicted and pseudo-disparities ($\textit{PSL Loss}$) and to minimize the image reconstruction loss ($\textit{SSL Loss}$).
  • Figure 2: The dataset for large-scale video pre-training is curated from publicly available videos. It consists of three equally represented classes - city walking, driving, and hiking - offering a diverse range of outdoor environments.
  • Figure 3: Qualitative results on KITTI kitti Eigen split eigen_split, compared with the teacher model depthanything and baseline models monodepth2swiftdepthlitemono. NimbleD observably enhances the depth estimation quality of all baseline models. It identifies distant objects not detected by the teacher model and demonstrates improved handling of sky regions compared to the baseline models.
  • Figure 4: Zero-shot qualitative results on NYUv2 nyuv2, compared to the baseline models swiftdepthlitemono. NimbleD noticeably improves the generalization ability of both baseline models.
  • Figure 5: Zero-shot qualitative results on Make3D make3d, compared to the baseline models swiftdepthlitemono. NimbleD noticeably improves the generalization ability of both baseline models.