Table of Contents
Fetching ...

Boosting Multi-View Stereo with Depth Foundation Model in the Absence of Real-World Labels

Jie Zhu, Bo Peng, Zhe Zhang, Bingzheng Liu, Jianjun Lei

TL;DR

DFM-MVS tackles the challenge of training multi-view stereo without real-world depth labels by leveraging a depth foundation model (Depth Anything V2) to generate a realistic depth prior. It introduces a depth prior-based pseudo-supervised training mechanism (DPPTM) and a depth prior-guided error correction strategy (DPECS) to supervise and stabilize a coarse-to-fine MVS network, respectively. Across DTU and Tanks & Temples, the approach achieves state-of-the-art performance among label-free methods and competes with, or surpasses, several label-supervised baselines, with ablations confirming the effectiveness of both DPPTM and DPECS. The work demonstrates the practical potential of depth foundation models for scalable, real-label-free 3D reconstruction, addressing both supervision quality and early-stage error propagation in MVS pipelines.

Abstract

Learning-based Multi-View Stereo (MVS) methods have made remarkable progress in recent years. However, how to effectively train the network without using real-world labels remains a challenging problem. In this paper, driven by the recent advancements of vision foundation models, a novel method termed DFM-MVS, is proposed to leverage the depth foundation model to generate the effective depth prior, so as to boost MVS in the absence of real-world labels. Specifically, a depth prior-based pseudo-supervised training mechanism is developed to simulate realistic stereo correspondences using the generated depth prior, thereby constructing effective supervision for the MVS network. Besides, a depth prior-guided error correction strategy is presented to leverage the depth prior as guidance to mitigate the error propagation problem inherent in the widely-used coarse-to-fine network structure. Experimental results on DTU and Tanks & Temples datasets demonstrate that the proposed DFM-MVS significantly outperforms existing MVS methods without using real-world labels.

Boosting Multi-View Stereo with Depth Foundation Model in the Absence of Real-World Labels

TL;DR

DFM-MVS tackles the challenge of training multi-view stereo without real-world depth labels by leveraging a depth foundation model (Depth Anything V2) to generate a realistic depth prior. It introduces a depth prior-based pseudo-supervised training mechanism (DPPTM) and a depth prior-guided error correction strategy (DPECS) to supervise and stabilize a coarse-to-fine MVS network, respectively. Across DTU and Tanks & Temples, the approach achieves state-of-the-art performance among label-free methods and competes with, or surpasses, several label-supervised baselines, with ablations confirming the effectiveness of both DPPTM and DPECS. The work demonstrates the practical potential of depth foundation models for scalable, real-label-free 3D reconstruction, addressing both supervision quality and early-stage error propagation in MVS pipelines.

Abstract

Learning-based Multi-View Stereo (MVS) methods have made remarkable progress in recent years. However, how to effectively train the network without using real-world labels remains a challenging problem. In this paper, driven by the recent advancements of vision foundation models, a novel method termed DFM-MVS, is proposed to leverage the depth foundation model to generate the effective depth prior, so as to boost MVS in the absence of real-world labels. Specifically, a depth prior-based pseudo-supervised training mechanism is developed to simulate realistic stereo correspondences using the generated depth prior, thereby constructing effective supervision for the MVS network. Besides, a depth prior-guided error correction strategy is presented to leverage the depth prior as guidance to mitigate the error propagation problem inherent in the widely-used coarse-to-fine network structure. Experimental results on DTU and Tanks & Temples datasets demonstrate that the proposed DFM-MVS significantly outperforms existing MVS methods without using real-world labels.

Paper Structure

This paper contains 20 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Performance comparison of state-of-the-art MVS methods without requiring real-world labels on (a) DTU (lower is better) and (b) Tanks & Temples (higher is better).
  • Figure 2: The overall architecture of the proposed DFM-MVS, which includes (a) depth prior-based pseduo-supervised training mechanism (DPPTM), and (b) depth prior-guided error correction strategy (DPECS).
  • Figure 3: Visualization comparison of point clouds reconstructed by different methods on the DTU evaluation set. Rows one and three show the full reconstructed point clouds, while rows two and four show the zoomed-in views of those red-outlined regions.
  • Figure 4: Error visualization of point clouds reconstructed by different methods on the Tanks & Temples benchmark dataset. Darker areas in the map indicate larger errors in the point cloud.
  • Figure 5: Visualization comparison of point clouds reconstructed by different variant methods on the DTU evaluation set.