Table of Contents
Fetching ...

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Boris Chidlovskii, Leonid Antsfeld

TL;DR

This work tackles monocular depth and visual odometry estimation under minimal supervision by combining CroCo cross-view pretraining with self-supervised finetuning on unlabeled videos. A shared ViT-based CroCo-DVO architecture learns 3D geometry during pretraining, then jointly predicts dense depth maps and 6-DoF camera poses from frame pairs, guided by geometric and photometric losses, including a self-discovered mask for dynamics. The approach is enhanced with Dense Prediction Transformer (DPT) layers and transfer adapters (AdaptFormer), which yield improved depth accuracy on multiple benchmarks. Across six diverse datasets, the method achieves state-of-the-art depth results and competitive VO performance, demonstrating robustness to indoor/outdoor, static/dynamic, and real/synthetic settings.

Abstract

For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

TL;DR

This work tackles monocular depth and visual odometry estimation under minimal supervision by combining CroCo cross-view pretraining with self-supervised finetuning on unlabeled videos. A shared ViT-based CroCo-DVO architecture learns 3D geometry during pretraining, then jointly predicts dense depth maps and 6-DoF camera poses from frame pairs, guided by geometric and photometric losses, including a self-discovered mask for dynamics. The approach is enhanced with Dense Prediction Transformer (DPT) layers and transfer adapters (AdaptFormer), which yield improved depth accuracy on multiple benchmarks. Across six diverse datasets, the method achieves state-of-the-art depth results and competitive VO performance, demonstrating robustness to indoor/outdoor, static/dynamic, and real/synthetic settings.

Abstract

For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
Paper Structure (12 sections, 4 equations, 5 figures, 9 tables)

This paper contains 12 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Simultaneous depth and visual odometry estimation from a video, with two steps of model pretraining and finetuning.
  • Figure 2: CroCo-DVO architecture. 1. Cross-view completion task is pretrained from a large amount of heterogeneous data. 2. At the finetuning step, given two consecutive frames ($I, I'$), we estimate their depth maps ($D, D'$) and relative pose $T$ using the network. Then we compute the self-supervised loss $L_{Self}$ which is composed of pixel-wise depth inconsistency between $D$ and $D'$, geometric consistency loss and a self-discovered mask to handle dynamic objects (see Section \ref{['ssec:losses']}).
  • Figure 3: Adapters: Training adapters while freezing the main backbone. AdaptFormer chen22adapformer replaces the MLP block in the transformer encoder with AdaptMLP, which is consisted of two sub-branches. The MLP layer in the left branch, identical to the original network, is frozen, right branch introduce a lightweight module for task-specific finetuning.
  • Figure 4: Qualitative depth estimation for the state-of-the-art and our methods.
  • Figure 5: Visual odometry estimation: Estimated Gibson trajectories (orange) vs GT trajectories (blue).