Table of Contents
Fetching ...

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, Hujun Bao

TL;DR

This work tackles the challenge of producing multi-view-consistent 3D reconstructions from monocular cues by integrating SfM priors into a diffusion-based depth estimator, enabling high-quality depth maps without heavy multi-view matching. Murre uses a densified SfM point cloud as an explicit conditioning signal for a latent diffusion model initialized from a 2D foundation model, and aligns predicted depths to SfM scales before fusion into 3D geometry. The approach demonstrates superior depth and reconstruction performance across indoor, object-level, street, and aerial datasets, with strong generalization from synthetic to real data and robust ablations illustrating the benefits of SfM conditioning, depth normalization, and flexibly choosing SfM backends. The work advances practical 3D reconstruction by combining mature SfM priors with modern diffusion priors, offering a scalable path toward high-quality reconstructions in diverse environments, and is accompanied by publicly available code and materials.

Abstract

In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

TL;DR

This work tackles the challenge of producing multi-view-consistent 3D reconstructions from monocular cues by integrating SfM priors into a diffusion-based depth estimator, enabling high-quality depth maps without heavy multi-view matching. Murre uses a densified SfM point cloud as an explicit conditioning signal for a latent diffusion model initialized from a 2D foundation model, and aligns predicted depths to SfM scales before fusion into 3D geometry. The approach demonstrates superior depth and reconstruction performance across indoor, object-level, street, and aerial datasets, with strong generalization from synthetic to real data and robust ablations illustrating the benefits of SfM conditioning, depth normalization, and flexibly choosing SfM backends. The work advances practical 3D reconstruction by combining mature SfM priors with modern diffusion priors, offering a scalable path toward high-quality reconstructions in diverse environments, and is accompanied by publicly available code and materials.

Abstract

In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at https://zju3dv.github.io/murre/ .

Paper Structure

This paper contains 28 sections, 2 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: We propose Murre, a new method for multi-view 3D reconstruction based on SfM-guided monocular depth estimation. Building on Stable Diffusion rombach2021highresolution, Murre demonstrates exceptional generalization capabilities after fine-tuning on a modest amount of synthetic data. Murre is capable of achieving high-quality reconstructions for a variety of real-world scenarios, including object-level, indoor, street, and aerial scenes.
  • Figure 2: Overview of our multi-view reconstruction pipeline. Given multi-view images, we first employ a Structure from Motion (SfM) method he2024dfsfm to derive sparse 3D scene structures (a). These 3D structures are then encoded into an intermediate explicit representation (b), which is used as a condition for depth estimation (c). Finally, we conduct a TSDF fusion newcombe2011kinectfusion to achieve the final reconstruction (d).
  • Figure 3: Qualitative comparison on DTU dataset. Except for DUSt3R which directly infers the point cloud, all other methods use Fusibile galliani2015massively to fuse depth maps, filtering out points with inconsistencies across different viewpoints. Murre produces a more complete final point cloud due to its relatively accurate multi-view consistent depth estimation. Please refer to the supplementary materials for results on other datasets.
  • Figure 4: Qualitative ablations of SfM methods. We conduct ablations on both texture-rich and textureless scenes.
  • Figure 5: Qualitative comparison of depth estimation on DTU jensen2014large.
  • ...and 9 more figures