Table of Contents
Fetching ...

Relative Pose Estimation through Affine Corrections of Monocular Depth Priors

Yifan Yu, Shaohui Liu, Rémi Pautrat, Marc Pollefeys, Viktor Larsson

TL;DR

This work tackles how to leverage monocular depth priors for relative pose estimation by acknowledging affine ambiguities in depth predictions. It introduces three solvers that jointly estimate the relative pose $(\mathbf{R}, \mathbf{t})$ and depth-affine parameters $(\alpha, \beta_1, \beta_2)$, under calibrated and uncalibrated camera setups, with optional focal-lengths. A hybrid LO-MSAC framework is then proposed to fuse depth-aware solvers with classic point-based and epipolar constraints, using depth-induced reprojection errors alongside Sampson errors for robust scoring and optimization. Experimental results on ScanNet-1500, MegaDepth-1500, ETH3D, and related datasets show consistent improvements over baselines across settings, and ablations confirm the benefit of incorporating depth affine shifts as well as the proposed hybrid estimation strategy. The approach is model-agnostic with respect to the depth priors and image matchers, and the authors provide open-source code to facilitate integration into existing pipelines.

Abstract

Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at https://github.com/MarkYu98/madpose.

Relative Pose Estimation through Affine Corrections of Monocular Depth Priors

TL;DR

This work tackles how to leverage monocular depth priors for relative pose estimation by acknowledging affine ambiguities in depth predictions. It introduces three solvers that jointly estimate the relative pose and depth-affine parameters , under calibrated and uncalibrated camera setups, with optional focal-lengths. A hybrid LO-MSAC framework is then proposed to fuse depth-aware solvers with classic point-based and epipolar constraints, using depth-induced reprojection errors alongside Sampson errors for robust scoring and optimization. Experimental results on ScanNet-1500, MegaDepth-1500, ETH3D, and related datasets show consistent improvements over baselines across settings, and ablations confirm the benefit of incorporating depth affine shifts as well as the proposed hybrid estimation strategy. The approach is model-agnostic with respect to the depth priors and image matchers, and the authors provide open-source code to facilitate integration into existing pipelines.

Abstract

Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at https://github.com/MarkYu98/madpose.
Paper Structure (17 sections, 12 equations, 9 figures, 10 tables)

This paper contains 17 sections, 12 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Our method jointly estimates affine corrections of monocular depth maps $D_1 + \beta_1$ and $\alpha (D_2 + \beta_2)$ together with relative pose $\bm{R}, \bm{t}$ (Right), whereas the classic way of aligning the depth maps with only scale modeling ($\alpha$) leads to wrong and distorted alignments (Left).
  • Figure 2: Pipeline overview: Our method takes a pair of images as input, runs off-the-shelf feature matching and monocular depth estimation, then jointly estimates the relative pose, scale and shift parameters of the two depth maps, and optionally the focal lengths. Our method incorporates monocular depth priors in all stages (in green) of hybrid LO-MSAC hybridransacLOMSAC, including 3 new depth-aware solvers, while still being able to leverage traditional point-based solvers Nister03FivePthartley_zisserman_2004Stewenius2008 (in blue).
  • Figure 3: Pose error AUCs on sampled indoor ETH3D ETH3D image pairs with covisible GT points less than the thresholds on X-axis. Left: calibrated, SP+LG, and MoGe wang2024moge priors; Right: shared-focal, SP+LG, and DAv2-met.depth_anything_v2 metric priors.
  • Figure 4: Visualization on ETH3D ETH3D. Left: back-projected GT depth with pose found by point-based method (translation rescaled to match GT); Middle: back-projected depth priors from Marigold marigold aligned using the scale, shifts, pose, and focal length from our method; Right: GT depth with GT pose.
  • Figure 5: Rotation and translation error by adding shift values to GT depth as "depth priors" on ScanNet-1500 dai2017scannet.
  • ...and 4 more figures