Learning-based Multi-View Stereo: A Survey
Fangjinhua Wang, Qingtian Zhu, Di Chang, Quankai Gao, Junlin Han, Tong Zhang, Richard Hartley, Marc Pollefeys
TL;DR
This survey analyzes learning-based Multi-View Stereo (MVS) approaches by organizing them around depth-map, voxel, NeRF, 3D Gaussian Splatting, and large feed-forward representations, with a primary focus on depth-map-based methods for their efficiency and scalability. It dissects the end-to-end pipeline—covering camera calibration, view selection, plane-sweep depth estimation, cost volumes, and depth fusion—while contrasting online and offline paradigms and evaluating performance on standard benchmarks. The work also surveys unsupervised and semi-supervised variants, as well as non-depth representations, highlighting strengths, limitations, and practical trade-offs across accuracy, efficiency, and generalization. Finally, it outlines future directions, including richer datasets, pixel-level view selection, integration of priors, generative reconstruction, and efficiency improvements to advance real-world applicability of MVS systems.
Abstract
3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.
