Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry
Boris Chidlovskii, Leonid Antsfeld
TL;DR
This work tackles monocular depth and visual odometry estimation under minimal supervision by combining CroCo cross-view pretraining with self-supervised finetuning on unlabeled videos. A shared ViT-based CroCo-DVO architecture learns 3D geometry during pretraining, then jointly predicts dense depth maps and 6-DoF camera poses from frame pairs, guided by geometric and photometric losses, including a self-discovered mask for dynamics. The approach is enhanced with Dense Prediction Transformer (DPT) layers and transfer adapters (AdaptFormer), which yield improved depth accuracy on multiple benchmarks. Across six diverse datasets, the method achieves state-of-the-art depth results and competitive VO performance, demonstrating robustness to indoor/outdoor, static/dynamic, and real/synthetic settings.
Abstract
For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
