Table of Contents
Fetching ...

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

TL;DR

MVSGaussian introduces a fast, generalizable Gaussian Splatting approach by leveraging MVS-derived depth to create pixel-aligned Gaussian representations. It adds a depth-aware volume-rendering pathway and a multi-view geometric-consistent aggregation scheme to initialize fast per-scene optimization. The method achieves state-of-the-art generalization across DTU, Real Forward-facing, NeRF Synthetic, and Tanks and Temples, while delivering real-time rendering and dramatically reduced finetuning costs. Compared with prior generalizable NeRFs and vanilla 3D-GS, MVSGaussian offers superior view synthesis with faster, more scalable training and inference.

Abstract

We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware Gaussian representations and decode them into Gaussian parameters. 2) To further enhance performance, we propose a hybrid Gaussian rendering that integrates an efficient volume rendering design for novel view synthesis. 3) To support fast fine-tuning for specific scenes, we introduce a multi-view geometric consistent aggregation strategy to effectively aggregate the point clouds generated by the generalizable model, serving as the initialization for per-scene optimization. Compared with previous generalizable NeRF-based methods, which typically require minutes of fine-tuning and seconds of rendering per image, MVSGaussian achieves real-time rendering with better synthesis quality for each scene. Compared with the vanilla 3D-GS, MVSGaussian achieves better view synthesis with less training computational cost. Extensive experiments on DTU, Real Forward-facing, NeRF Synthetic, and Tanks and Temples datasets validate that MVSGaussian attains state-of-the-art performance with convincing generalizability, real-time rendering speed, and fast per-scene optimization.

MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

TL;DR

MVSGaussian introduces a fast, generalizable Gaussian Splatting approach by leveraging MVS-derived depth to create pixel-aligned Gaussian representations. It adds a depth-aware volume-rendering pathway and a multi-view geometric-consistent aggregation scheme to initialize fast per-scene optimization. The method achieves state-of-the-art generalization across DTU, Real Forward-facing, NeRF Synthetic, and Tanks and Temples, while delivering real-time rendering and dramatically reduced finetuning costs. Compared with prior generalizable NeRFs and vanilla 3D-GS, MVSGaussian offers superior view synthesis with faster, more scalable training and inference.

Abstract

We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware Gaussian representations and decode them into Gaussian parameters. 2) To further enhance performance, we propose a hybrid Gaussian rendering that integrates an efficient volume rendering design for novel view synthesis. 3) To support fast fine-tuning for specific scenes, we introduce a multi-view geometric consistent aggregation strategy to effectively aggregate the point clouds generated by the generalizable model, serving as the initialization for per-scene optimization. Compared with previous generalizable NeRF-based methods, which typically require minutes of fine-tuning and seconds of rendering per image, MVSGaussian achieves real-time rendering with better synthesis quality for each scene. Compared with the vanilla 3D-GS, MVSGaussian achieves better view synthesis with less training computational cost. Extensive experiments on DTU, Real Forward-facing, NeRF Synthetic, and Tanks and Temples datasets validate that MVSGaussian attains state-of-the-art performance with convincing generalizability, real-time rendering speed, and fast per-scene optimization.
Paper Structure (18 sections, 13 equations, 11 figures, 16 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 11 figures, 16 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison with existing methods. (a) We present the generalizable results on the Real Forward-facing dataset llff. Compared with other competitors, our method achieves better performance at a faster inference speed. (b) The results after per-scene optimization, where circle size represents optimization time. Our method achieves optimal performance in just $45$ seconds. (c) We illustrate a scene ("room"), showcasing the (PSNR/optimization time) of synthesized views, with "-" indicating results from direct inference using the generalizable model.
  • Figure 2: Overview of MVSGaussian. We first extract features $\{f_i\}_{i=1}^N$ from input source views $\{I_i\}_{i=1}^N$ using FPN. These features are then aggregated into a cost volume, regularized by 3D CNNs to produce depth. Subsequently, for each 3D point at the estimated depth, we use a pooling network to aggregate warped source features, obtaining the aggregated feature $f_v$. This feature is then enhanced using a 2D UNet, yielding the enhanced feature $f_g$. $f_g$ is decoded into Gaussian parameters for splatting, while $f_v$ is decoded into volume density and radiance for depth-aware volume rendering. Finally, the two rendered images are averaged to produce the final rendered result.
  • Figure 3: Consistent aggregation. With depth maps and point clouds produced by the generalizable model, we first conduct geometric consistency checks on depths to derive masks for filtering out unreliable points. The filtered point clouds are then concatenated to obtain a point cloud, serving as the initialization for per-scene optimization.
  • Figure 4: Qualitative comparison of rendering quality under generalization and 3-view settings with state-of-the-art methods mvsnerfenerfmatchnerf.
  • Figure 5: Qualitative comparison of rendering quality with state-of-the-art methods mvsnerfenerf3Dgaussians after per-scene optimization.
  • ...and 6 more figures