Table of Contents
Fetching ...

Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

Shengjie Zhu, Ahmed Abdelkader, Mark J. Matthews, Xiaoming Liu, Wen-Sheng Chu

TL;DR

This work shows that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks, and proposes Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density.

Abstract

Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.

Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates

TL;DR

This work shows that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks, and proposes Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density.

Abstract

Structure-from-Motion (SfM) is a fundamental 3D vision task for recovering camera parameters and scene geometry from multi-view images. While recent deep learning advances enable accurate Monocular Depth Estimation (MDE) from single images without depending on camera motion, integrating MDE into SfM remains a challenge. Unlike conventional triangulated sparse point clouds, MDE produces dense depth maps with significantly higher error variance. Inspired by modern RANSAC estimators, we propose Marginalized Bundle Adjustment (MBA) to mitigate MDE error variance leveraging its density. With MBA, we show that MDE depth maps are sufficiently accurate to yield SoTA or competitive results in SfM and camera relocalization tasks. Through extensive evaluations, we demonstrate consistently robust performance across varying scales, ranging from few-frame setups to large multi-view systems with thousands of images. Our method highlights the significant potential of MDE in multi-view 3D vision.
Paper Structure (20 sections, 14 equations, 5 figures, 15 tables)

This paper contains 20 sections, 14 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Marginalized Bundle Adjustment (MBA). Our method registers monocular depth maps into a consistent 3D coordinate system. Red camera icons indicate the viewpoints of registered depth maps. Monocular depth provides a strong structural prior, yet its predictions are inherently high-variance, as reflected by the noisy appearance of the reconstructed point cloud. This makes classical Bundle Adjustment designed for accurate sparse point cloud ill-suited. We therefore introduce a RANSAC-inspired Bundle Adjustment objective that leverages the depth maps’ density to robustly accommodate their variance. Although depth-derived point clouds have lower visual fidelity, our experiments show that monocular depth already supports SoTA or competitive performance on diverse SfM benchmarks, as exemplified ScanNet dai2017scannet, IMC2021 bi2021method, and ETH3D bi2021method. It highlights significant potential of monocular depth models for multi-view vision tasks.
  • Figure 2: System Overview. With $N$ RGBs, the system consumes dense depth maps and pairwise correspondence inferred by pretrained models. The system outputs intrinsics, extrinsics, and frame-wise depth affine corrections scalars. A sparse $N\times N$ pose graph is built from co-visible frames using correspondences. Dense inputs are subsampled into a data matrix of $| \mathcal{E}| \times \kappa \times 5$ (graph edges count) for scalable multi-GPU optimization. After initialization, the Bundle Adjustment proceeds from coarse to fine. In coarse stage, the BA objective is evaluated and summed over “star-shaped” subgraph $\mathcal{G}_i$ of each frame $i$. One subgraph includes itself plus its co‑visible neighbors, marked as one colored row in coarse pose graph. Fine stage computes with full graph. The BA applies gradient descent for fixed iterations.
  • Figure 3: CDF and PDF of empirical residual distribution $\mathcal{R}$. In \ref{['eqn:binary_scoring_funciton_thresholds']}, our BA maximizes the area-under-the-curve of $\mathcal{R}$'s CDF curve up a maximum threshold. The BA automatically formulates a smoothed categorization of inliers versus outliers. The forward and backward computation ( Eqn. \ref{['eqn:fine_forward']} & \ref{['eqn:fine_backward']} ) at the residual $r$ is succinctly defined as indexing the curve at $F(r)$ and $p(r)$.
  • Figure 4: Dense Depth and Correspondence are inputs to our multi-view pose estimation system. We benchmark pose performance with depth map under diverse imaging conditions. Figures (a) to (e) include ScanNet ScanNet_Dai_2017_CVPR indoor images, T&T knapitsch2017tanks and ETH3D Schops_2017_CVPR high resolution images, IMC2021 bi2021method internet-collected images, and Wayspots Brachmann_2023_CVPR flipped image. Fig. (f) visualizes dense correspondence.
  • Figure 5: Camera Re-localization on 7-Scenes and Wayspots. Green and blue mark predicted and groundtruth odometry. We present challenging sequences of repetitive, textureless images. The images exhibit (1) scale changes, (2) flipping, and (3) a lack of distinguishable depth references. Consequently, the model predicts sub-optimal depth maps. Surprisingly, despite these significant challenges, depth maps still support accurate camera poses.