Table of Contents
Fetching ...

Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction

Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou

TL;DR

Mono3R addresses the limitations of matching-based multi-view reconstruction in textureless or occluded regions by integrating monocular geometric priors into a dual-branch framework. It combines a DUSt3R-style pairwise branch with a monocular MoGe-based branch and a monocular-cues guided refinement module that performs global Sim(3) alignment and iterative refinement to produce robust dense point maps. Across five benchmarks, Mono3R achieves significant improvements in both multi-view pose estimation and dense reconstruction, particularly in challenging indoor scenes where previous methods faltered. The work demonstrates that monocular priors can efficiently bolster multi-view consistency and detail recovery, offering a practical path toward more reliable 3D reconstruction in real-world conditions.

Abstract

Recent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the inherent shortcomings of matching-based methods. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.

Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction

TL;DR

Mono3R addresses the limitations of matching-based multi-view reconstruction in textureless or occluded regions by integrating monocular geometric priors into a dual-branch framework. It combines a DUSt3R-style pairwise branch with a monocular MoGe-based branch and a monocular-cues guided refinement module that performs global Sim(3) alignment and iterative refinement to produce robust dense point maps. Across five benchmarks, Mono3R achieves significant improvements in both multi-view pose estimation and dense reconstruction, particularly in challenging indoor scenes where previous methods faltered. The work demonstrates that monocular priors can efficiently bolster multi-view consistency and detail recovery, offering a practical path toward more reliable 3D reconstruction in real-world conditions.

Abstract

Recent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the inherent shortcomings of matching-based methods. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.

Paper Structure

This paper contains 19 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: In this paper, we reveal the limitations of DUSt3R in reconstructing textureless regions and fine structures, as demonstrated in the 2nd row. This fundamentally stems from matching-based approaches, where matching consistency proves difficult to maintain in such challenging areas. To address these limitations, we propose Mono3R, which integrate the robustness of monocular geometry estimation into DUSt3R. Our method can both reconstruct accurate geometry in textureless regions and recover fine structural details, as shown in 3rd row.
  • Figure 2: Our framework consists of two complementary branches and a refinement module. The pairwise branch processes image pairs through feature matching to simultaneously extract cross-image feature correspondences and regress 3D point clouds The monocular branch processes individual images to extract view-specific geometric information. The mono-guided refinement module first performs global Sim(3) alignment to establish a unified coordinate system for the monocular outputs, then iteratively optimizes the pairwise reconstruction for improved accuracy.
  • Figure 3: Comparision between aligned monocular pointmaps $\{M_{i}\}$ and pairwise pointmaps $\{P_i^0\}$. Although the monocular pointmap has undergone global alignment with the predictions from the pairwise branch, the aligned results still exhibit severe discrepancy.
  • Figure 4: Qualitative comparison of our predicted depthmaps and 3D points to DUSt3R on in-the-wild captured images. Colored camera frustums illustrate the estimated camera poses. As shown in the top row, our method successfully predicts the thin tubular structure of metal pipes, while DUSt3R predicts a significantly distorted structure. In the second row, our method robustly recovers the flat door structure from two images with repeated textures, while DUSt3R generates false depth discontinuities that violate the planar surface prior.
  • Figure 5: Qualitative examples of Mono3R’s output.
  • ...and 2 more figures