Boosting Multi-View Stereo with Depth Foundation Model in the Absence of Real-World Labels
Jie Zhu, Bo Peng, Zhe Zhang, Bingzheng Liu, Jianjun Lei
TL;DR
DFM-MVS tackles the challenge of training multi-view stereo without real-world depth labels by leveraging a depth foundation model (Depth Anything V2) to generate a realistic depth prior. It introduces a depth prior-based pseudo-supervised training mechanism (DPPTM) and a depth prior-guided error correction strategy (DPECS) to supervise and stabilize a coarse-to-fine MVS network, respectively. Across DTU and Tanks & Temples, the approach achieves state-of-the-art performance among label-free methods and competes with, or surpasses, several label-supervised baselines, with ablations confirming the effectiveness of both DPPTM and DPECS. The work demonstrates the practical potential of depth foundation models for scalable, real-label-free 3D reconstruction, addressing both supervision quality and early-stage error propagation in MVS pipelines.
Abstract
Learning-based Multi-View Stereo (MVS) methods have made remarkable progress in recent years. However, how to effectively train the network without using real-world labels remains a challenging problem. In this paper, driven by the recent advancements of vision foundation models, a novel method termed DFM-MVS, is proposed to leverage the depth foundation model to generate the effective depth prior, so as to boost MVS in the absence of real-world labels. Specifically, a depth prior-based pseudo-supervised training mechanism is developed to simulate realistic stereo correspondences using the generated depth prior, thereby constructing effective supervision for the MVS network. Besides, a depth prior-guided error correction strategy is presented to leverage the depth prior as guidance to mitigate the error propagation problem inherent in the widely-used coarse-to-fine network structure. Experimental results on DTU and Tanks & Temples datasets demonstrate that the proposed DFM-MVS significantly outperforms existing MVS methods without using real-world labels.
