Table of Contents
Fetching ...

MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction

Yaopeng Lou, Liao Shen, Tianqi Liu, Jiaqi Li, Zihao Huang, Huiqiang Sun, Zhiguo Cao

TL;DR

MuGS tackles the challenge of generalizable novel view synthesis across varying input baselines by fusing multi-view stereo and monocular depth cues within a 3D Gaussian splatting framework. A projection-and-sampling depth fusion refines the depth volume, guided by a learned consistency cue and enhanced by a lightweight attention mechanism, while a reference-view loss provides contextual supervision. The method achieves state-of-the-art or competitive results across small- and large-baseline datasets and shows promising zero-shot performance, demonstrating practical impact for versatile 3D reconstruction and rendering without per-scene optimization. The approach extends fast rendering and robust geometry reconstruction by leveraging both robust MDE features and precise MVS depth guidance, with ablations confirming the contributions of depth refinement, feature augmentation, and reference supervision.

Abstract

We present Multi-Baseline Gaussian Splatting (MuGS), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines. Specifically, we integrate features from Multi-View Stereo (MVS) and Monocular Depth Estimation (MDE) to enhance feature representations for generalizable reconstruction. Next, We propose a projection-and-sampling mechanism for deep depth fusion, which constructs a fine probability volume to guide the regression of the feature map. Furthermore, We introduce a reference-view loss to improve geometry and optimization efficiency. We leverage 3D Gaussian representations to accelerate training and inference time while enhancing rendering quality. MuGS achieves state-of-the-art performance across multiple baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K). We also demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets. Code is available at https://github.com/EuclidLou/MuGS.

MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction

TL;DR

MuGS tackles the challenge of generalizable novel view synthesis across varying input baselines by fusing multi-view stereo and monocular depth cues within a 3D Gaussian splatting framework. A projection-and-sampling depth fusion refines the depth volume, guided by a learned consistency cue and enhanced by a lightweight attention mechanism, while a reference-view loss provides contextual supervision. The method achieves state-of-the-art or competitive results across small- and large-baseline datasets and shows promising zero-shot performance, demonstrating practical impact for versatile 3D reconstruction and rendering without per-scene optimization. The approach extends fast rendering and robust geometry reconstruction by leveraging both robust MDE features and precise MVS depth guidance, with ablations confirming the contributions of depth refinement, feature augmentation, and reference supervision.

Abstract

We present Multi-Baseline Gaussian Splatting (MuGS), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines. Specifically, we integrate features from Multi-View Stereo (MVS) and Monocular Depth Estimation (MDE) to enhance feature representations for generalizable reconstruction. Next, We propose a projection-and-sampling mechanism for deep depth fusion, which constructs a fine probability volume to guide the regression of the feature map. Furthermore, We introduce a reference-view loss to improve geometry and optimization efficiency. We leverage 3D Gaussian representations to accelerate training and inference time while enhancing rendering quality. MuGS achieves state-of-the-art performance across multiple baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K). We also demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets. Code is available at https://github.com/EuclidLou/MuGS.

Paper Structure

This paper contains 14 sections, 11 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: MuGS achieves the best performance in both large- and small-baseline.
  • Figure 2: MuGS can generalize across different baselines.
  • Figure 4: Overview. Given input images ${\{I_i\}}_{i=1}^N$, we first extract multi-view image features from both the monocular encoder and MVS's cross-view encoder. The MVS features are used to regress a target view depth probability volume, while monocular features are decoded into source view depth maps ${\{\mathcal{D}_i\}}_{i=1}^N$. By projecting the points of the depth probability volume to and sampling from the depth map $\mathcal{D}_i$, we obtain $d^p_i$ and $d^s_i$, which are then fed into a U-net to query for a refined probability volume $\mathbb{V}^p_{fine}$. Besides, both features are concatenated to construct the feature volume. Next, we calculate the expected value of depth and feature using $\mathbb{V}^p_{fine}$, which produces the target depth and feature map. These are used to predict Gaussian parameters. Finally, the target view image and source reference views are rendered, which contribute to the total loss together.
  • Figure 5: 2-view small-baseline results on the DTU jensen_large_2014 dataset. Our method renders higher quality with fewer errors than other small-baseline methods.
  • Figure 6: 2-view large-baseline results on the RealEstate10K dataset. The images rendered by our method exhibit superior geometric accuracy and reduced distortion.
  • ...and 4 more figures