NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training
Albert Luginov, Muhammad Shahzad
TL;DR
NimbleD addresses the need for fast, low-latency monocular depth estimation suitable for real-time metaverse applications by pairing self-supervision with pseudo-label supervision from a large vision model and large-scale video pre-training. It introduces a lightweight framework with a depth network, a camera network that learns intrinsics, and a teacher that provides pseudo-disparities, optimized through a simple, joint loss that combines SSL and PSL terms. The method demonstrates strong gains on KITTI while maintaining efficiency, and shows improved zero-shot generalization to NYUv2 and Make3D, enabled by training without camera intrinsics and extensive video pre-training. Overall, NimbleD enables small, fast MDE models to reach state-of-the-art SSL performance, with practical benefits for latency-constrained AR/VR and metaverse deployments.
Abstract
We introduce NimbleD, an efficient self-supervised monocular depth estimation learning framework that incorporates supervision from pseudo-labels generated by a large vision model. This framework does not require camera intrinsics, enabling large-scale pre-training on publicly available videos. Our straightforward yet effective learning strategy significantly enhances the performance of fast and lightweight models without introducing any overhead, allowing them to achieve performance comparable to state-of-the-art self-supervised monocular depth estimation models. This advancement is particularly beneficial for virtual and augmented reality applications requiring low latency inference. The source code, model weights, and acknowledgments are available at https://github.com/xapaxca/nimbled .
