Table of Contents
Fetching ...

Learning-based Multi-View Stereo: A Survey

Fangjinhua Wang, Qingtian Zhu, Di Chang, Quankai Gao, Junlin Han, Tong Zhang, Richard Hartley, Marc Pollefeys

TL;DR

This survey analyzes learning-based Multi-View Stereo (MVS) approaches by organizing them around depth-map, voxel, NeRF, 3D Gaussian Splatting, and large feed-forward representations, with a primary focus on depth-map-based methods for their efficiency and scalability. It dissects the end-to-end pipeline—covering camera calibration, view selection, plane-sweep depth estimation, cost volumes, and depth fusion—while contrasting online and offline paradigms and evaluating performance on standard benchmarks. The work also surveys unsupervised and semi-supervised variants, as well as non-depth representations, highlighting strengths, limitations, and practical trade-offs across accuracy, efficiency, and generalization. Finally, it outlines future directions, including richer datasets, pixel-level view selection, integration of priors, generative reconstruction, and efficiency improvements to advance real-world applicability of MVS systems.

Abstract

3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.

Learning-based Multi-View Stereo: A Survey

TL;DR

This survey analyzes learning-based Multi-View Stereo (MVS) approaches by organizing them around depth-map, voxel, NeRF, 3D Gaussian Splatting, and large feed-forward representations, with a primary focus on depth-map-based methods for their efficiency and scalability. It dissects the end-to-end pipeline—covering camera calibration, view selection, plane-sweep depth estimation, cost volumes, and depth fusion—while contrasting online and offline paradigms and evaluating performance on standard benchmarks. The work also surveys unsupervised and semi-supervised variants, as well as non-depth representations, highlighting strengths, limitations, and practical trade-offs across accuracy, efficiency, and generalization. Finally, it outlines future directions, including richer datasets, pixel-level view selection, integration of priors, generative reconstruction, and efficiency improvements to advance real-world applicability of MVS systems.

Abstract

3D reconstruction aims to recover the dense 3D structure of a scene. It plays an essential role in various applications such as Augmented/Virtual Reality (AR/VR), autonomous driving and robotics. Leveraging multiple views of a scene captured from different viewpoints, Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments. Due to its efficiency and effectiveness, MVS has become a pivotal method for image-based 3D reconstruction. Recently, with the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods. We categorize these learning-based methods as: depth map-based, voxel-based, NeRF-based, 3D Gaussian Splatting-based, and large feed-forward methods. Among these, we focus significantly on depth map-based methods, which are the main family of MVS due to their conciseness, flexibility and scalability. In this survey, we provide a comprehensive review of the literature at the time of this writing. We investigate these learning-based methods, summarize their performances on popular benchmarks, and discuss promising future research directions in this area.
Paper Structure (61 sections, 32 equations, 10 figures, 6 tables)

This paper contains 61 sections, 32 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: An overall illustration of both online and offline depth map-based MVS pipelines. Online MVS usually deals with sequential data, e.g., video, and employs TSDF volumes as an intermediate representation for mesh extraction. Given a full set of images, offline MVS holds the global information of the captured scene, and usually fuses estimated depth maps into a point cloud with filtering.
  • Figure 2: Taxonomy of Multi-View Stereo.
  • Figure 3: Pipeline of depth map-based MVS, which usually consists of feature extraction, cost volume construction via plane sweep, cost volume regularization, depth estimation, and depth refinement.
  • Figure 4: Pipeline of Atlas murez2020atlas. 2D image features are back-projected into 3D volumes, which are aggregated and passed through a 3D CNN to directly regress a TSDF volume.
  • Figure 5: Pipeline of NeRF mildenhall2020nerf. Given a 3D position and 2D viewing direction (a), an MLP produces the color and volume density (b). Then volume rendering is used to composite these values into an image (c). The optimization is minimizing the rendering loss (d).
  • ...and 5 more figures