Table of Contents
Fetching ...

MotionHint: Self-Supervised Monocular Visual Odometry with Motion Constraints

Cong Wang, Yu-Ping Wang, Dinesh Manocha

TL;DR

Experimental results show that the MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems to greatly improve the performance by reducing the resulting ATE by up to 28.73%.

Abstract

We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO) that takes motion constraints into account. A key aspect of our approach is to use an appropriate motion model that can help existing self-supervised monocular VO (SSM-VO) algorithms to overcome issues related to the local minima within their self-supervised loss functions. The motion model is expressed with a neural network named PPnet. It is trained to coarsely predict the next pose of the camera and the uncertainty of this prediction. Our self-supervised approach combines the original loss and the motion loss, which is the weighted difference between the prediction and the generated ego-motion. Taking two existing SSM-VO systems as our baseline, we evaluate our MotionHint algorithm on the standard KITTI benchmark. Experimental results show that our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems to greatly improve the performance by reducing the resulting ATE by up to 28.73%.

MotionHint: Self-Supervised Monocular Visual Odometry with Motion Constraints

TL;DR

Experimental results show that the MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems to greatly improve the performance by reducing the resulting ATE by up to 28.73%.

Abstract

We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO) that takes motion constraints into account. A key aspect of our approach is to use an appropriate motion model that can help existing self-supervised monocular VO (SSM-VO) algorithms to overcome issues related to the local minima within their self-supervised loss functions. The motion model is expressed with a neural network named PPnet. It is trained to coarsely predict the next pose of the camera and the uncertainty of this prediction. Our self-supervised approach combines the original loss and the motion loss, which is the weighted difference between the prediction and the generated ego-motion. Taking two existing SSM-VO systems as our baseline, we evaluate our MotionHint algorithm on the standard KITTI benchmark. Experimental results show that our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems to greatly improve the performance by reducing the resulting ATE by up to 28.73%.

Paper Structure

This paper contains 24 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Trajectories Comparison. (a) demonstrates trajectories of SC-Depth bian2021 (our baseline SSM-VO) and our improved version. (b), (c) and (d) provide the error map of different predicted trajectories using the evo toolbox evo2017. Our MotionHint algorithm can greatly improve the performance of SC-Depth by reducing resulting ATE by about 25%, especially at the part in the red circles.
  • Figure 2: Overview. Our MotionHint algorithm consists of three training phases. (a) Pre-training SSM-VO: We pre-train a SSM-VO and take it as our finetuned object. (b) Extract the motion model:PPnet is pre-trained to extract the motion model, which can predict the next pose and its uncertainty from a set of consecutive prior poses. (c) Finetune training the SSM-VO using the motion model:PPnet takes a set of consecutive prior poses saved in Pose Manager as input and predicts the pseudo pose. The pseudo pose is further used to build the pseudo label of the current predicted ego-motion. The weighted difference between the pseudo label and the predicted ego-motion generates the motion loss, which guides the SSM-VO out of local minima. '$\otimes$' computes the relative pose of two absolute poses; '$\ominus$' computes the weighted difference; '$\oplus$' computes the supervision loss of the original self-supervised system.
  • Figure 3: Pose centralization.$\bm{O}$ refers to the start point where the uncertainty is zero. The intensity of the red color highlights the level of uncertainty. We limit uncertainties of pose sequences to a fixed range by reselecting the starting point.
  • Figure 4: Qualitative results of PPnet. The black line refers to the ground truth and the colored points refer to poses predicted by PPnet. Points in a redder color indicate larger uncertainty.