Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction
Wenyu Li, Sidun Liu, Peng Qiao, Yong Dou
TL;DR
Mono3R addresses the limitations of matching-based multi-view reconstruction in textureless or occluded regions by integrating monocular geometric priors into a dual-branch framework. It combines a DUSt3R-style pairwise branch with a monocular MoGe-based branch and a monocular-cues guided refinement module that performs global Sim(3) alignment and iterative refinement to produce robust dense point maps. Across five benchmarks, Mono3R achieves significant improvements in both multi-view pose estimation and dense reconstruction, particularly in challenging indoor scenes where previous methods faltered. The work demonstrates that monocular priors can efficiently bolster multi-view consistency and detail recovery, offering a practical path toward more reliable 3D reconstruction in real-world conditions.
Abstract
Recent advances in data-driven geometric multi-view 3D reconstruction foundation models (e.g., DUSt3R) have shown remarkable performance across various 3D vision tasks, facilitated by the release of large-scale, high-quality 3D datasets. However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the inherent shortcomings of matching-based methods. Specifically, we introduce a monocular-guided refinement module that integrates monocular geometric priors into multi-view reconstruction frameworks. This integration substantially enhances the robustness of multi-view reconstruction systems, leading to high-quality feed-forward reconstructions. Comprehensive experiments across multiple benchmarks demonstrate that our method achieves substantial improvements in both mutli-view camera pose estimation and point cloud accuracy.
