Table of Contents
Fetching ...

VIMD: Monocular Visual-Inertial Motion and Depth Estimation

Saimouli Katragadda, Guoquan Huang

TL;DR

A monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking to exploit multi-view information to iteratively refine per-pixel scale.

Abstract

Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.

VIMD: Monocular Visual-Inertial Motion and Depth Estimation

TL;DR

A monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking to exploit multi-view information to iteratively refine per-pixel scale.

Abstract

Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.

Paper Structure

This paper contains 24 sections, 17 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A motivation result: Invariant affinity is not sufficient to scale monocular depth prediction. Per-frame least-squares fitting of a global scale and offset (every 5th frame) using VOID data Wong2020_RAL. Clearly, the offset varies more than the scale, indicating that a single global affine model cannot reliably align predictions over time.
  • Figure 2: Overall performance of the proposed VIMD: High-quality dense metric depths in both outdoor and indoor scenes.
  • Figure 3: The proposed visual-inertial motion and depth (VIMD) learning pipeline. (\ref{['fig:pipeline_1']}) System overview: The VIO filter efficiently fuses RGB images and IMU data to estimate sparse features' metric depth and camera poses, which are then passed to the iterative depth refinement module to predict dense metric depth and its uncertainty. (\ref{['fig:pipeline_2']}) Iterative refined metric depth module: The initial metric depth is estimated using the global alignment (GA) depth, which is the metric-aligned depth from the monocular depth estimator, fitted with a global scale and offset via least squares using the sparse depth. Reference frames are warped to the target frame using the predicted depth and VIO poses, and the scale is iteratively refined using a ConvGRU to predict the final depth and uncertainty. Multi-view information is leveraged to improve the accuracy and robustness of depth estimation.
  • Figure 4: Qualitative results of evaluation on VOID.
  • Figure 5: Qualitative results of evaluation on TartanAir.