Table of Contents
Fetching ...

MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors

Arda Alniak, Sinan Kalkan, Mustafa Mert Ankarali, Afsar Saranli, Abdullah Aydin Alatan

TL;DR

The paper addresses the challenge of obtaining metric-scale monocular VIO in low-texture environments under edge-device constraints. It introduces MDE-VIO, a framework that fuses lightweight monocular depth priors into the VINS-Mono pipeline via a front-end Depth-Injected Feature Tracking (DIFT) and a back-end consisting of affine-invariant depth residuals and Pairwise Ordinal Constraints (OrC), all moderated by uncertainty-based gating and MDE depth initialization. Key contributions include a practical edge-friendly fusion strategy, uncertainty-aware depth integration, depth initialization acceleration, and robust back-end constraints that collectively prevent divergence and improve Absolute Trajectory Error (ATE) by up to 28.3% on challenging datasets. The approach demonstrates that temporal consistency and back-end pose-geometry fusion are essential for reliable, real-time VIO on resource-constrained devices, offering a path toward reliable, depth-informed odometry in harsh operational settings.

Abstract

Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.

MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors

TL;DR

The paper addresses the challenge of obtaining metric-scale monocular VIO in low-texture environments under edge-device constraints. It introduces MDE-VIO, a framework that fuses lightweight monocular depth priors into the VINS-Mono pipeline via a front-end Depth-Injected Feature Tracking (DIFT) and a back-end consisting of affine-invariant depth residuals and Pairwise Ordinal Constraints (OrC), all moderated by uncertainty-based gating and MDE depth initialization. Key contributions include a practical edge-friendly fusion strategy, uncertainty-aware depth integration, depth initialization acceleration, and robust back-end constraints that collectively prevent divergence and improve Absolute Trajectory Error (ATE) by up to 28.3% on challenging datasets. The approach demonstrates that temporal consistency and back-end pose-geometry fusion are essential for reliable, real-time VIO on resource-constrained devices, offering a path toward reliable, depth-informed odometry in harsh operational settings.

Abstract

Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.
Paper Structure (14 sections, 7 equations, 2 figures, 3 tables)

This paper contains 14 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: High-level overview of the proposed MDE-VIO framework. The system augments the standard VINS-Mono by processing monocular images through a Monocular Depth Estimator. These learned depth priors are integrated into both the Front-End and Back-End of the estimator to achieve robust, metric-scaled trajectory estimation on robotic platforms.
  • Figure 2: MDE-VIO: The proposed VIO enhancement framework. (Left) The DIFT module replaces the Blue channel of the RGB input with normalized MDE predictions to enhance KLT tracking in low-texture regions. (Right) The proposed approach aligns learned depth priors $d_{MDE}$ for the set of tracked features $\mathcal{F}_k$ using affine parameters $(s, t)$, filters unstable estimates via a variance-based gate $w_{gate}(\sigma_i^2)$, and integrates both unary depth factors $r_{\mathcal{D}}$ and pairwise OrC $r(d_i, d_j)$ into the VIO Optimizer.