Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, Hujun Bao
TL;DR
This work tackles the challenge of producing multi-view-consistent 3D reconstructions from monocular cues by integrating SfM priors into a diffusion-based depth estimator, enabling high-quality depth maps without heavy multi-view matching. Murre uses a densified SfM point cloud as an explicit conditioning signal for a latent diffusion model initialized from a 2D foundation model, and aligns predicted depths to SfM scales before fusion into 3D geometry. The approach demonstrates superior depth and reconstruction performance across indoor, object-level, street, and aerial datasets, with strong generalization from synthetic to real data and robust ablations illustrating the benefits of SfM conditioning, depth normalization, and flexibly choosing SfM backends. The work advances practical 3D reconstruction by combining mature SfM priors with modern diffusion priors, offering a scalable path toward high-quality reconstructions in diverse environments, and is accompanied by publicly available code and materials.
Abstract
In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .
