Table of Contents
Fetching ...

Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

Xiufeng Huang, Ka Chun Cheung, Runmin Cong, Simon See, Renjie Wan

TL;DR

Stereo-GS tackles the high resource demands of generalizable 3D Gaussian Splatting by disentangling geometry and appearance within a multi-view stereo framework. It leverages a diffusion-generated set of multi-view images and a stereo vision backbone to produce dense multi-view feature tokens, which are fused with global attention to predict geometry as point-maps and appearance as Gaussian features, forming GS-maps. A two-stage training scheme and a refinement network reduce reliance on data priors, enabling pose-free, robust 3DGS reconstruction with improved training and inference efficiency. Experiments across multi-view and single-image-to-3D tasks show state-of-the-art quality with practical resource usage, and the approach demonstrates applicability to real-world scenes and faster turnaround times.

Abstract

Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.

Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

TL;DR

Stereo-GS tackles the high resource demands of generalizable 3D Gaussian Splatting by disentangling geometry and appearance within a multi-view stereo framework. It leverages a diffusion-generated set of multi-view images and a stereo vision backbone to produce dense multi-view feature tokens, which are fused with global attention to predict geometry as point-maps and appearance as Gaussian features, forming GS-maps. A two-stage training scheme and a refinement network reduce reliance on data priors, enabling pose-free, robust 3DGS reconstruction with improved training and inference efficiency. Experiments across multi-view and single-image-to-3D tasks show state-of-the-art quality with practical resource usage, and the approach demonstrates applicability to real-world scenes and faster turnaround times.

Abstract

Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.

Paper Structure

This paper contains 31 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Our proposed Stereo-GS generates multi-view GS-maps in a disentangled manner for predicting 3DGS geometry and appearance, enabling high-quality 3D Gaussian reconstruction. It first uses a stereo vision model to extract local feature tokens from image pairs, which are fused via multi-view global attention blocks. A point prediction head estimates geometry through multi-view point-maps, while a Gaussian prediction head generates Gaussian features for appearance. These are combined into GS-maps representing the 3DGS object, refined by a cross-view attention-based network, and rendered as per-pixel 3D Gaussians for novel views during training.
  • Figure 2: Multi-view reconstruction. Given the same multi-view inputs, the standard GS kerbl3Dgaussians totally fails to render novel view images. SplatterImage hardly reconstructs the 3D objects under the multiple view images. Although LGM tang2024lgm can reconstruct finer geometric structures and appearance details, it still faces challenges in maintaining consistency and avoiding artifacts. Our method can generate both high quality geometry and appearance for the 3D objects.
  • Figure 3: Single Image-to-3D generation. Our method generates the 3D object with better visual quality and more consistent geometry than the baseline methods.
  • Figure 4: Single Image-to-3D generation with 4 input views and 6 input views. We use V3D chen2024v3d to generate multi-view images, with the input image scaled up for better visualization and positioned in the upper-left corner of each input views group.
  • Figure 5: Training time efficiency. Comparison of model training time for predicting Gaussian features at the resolution of (a) $256 \times 256$ and (b) $128 \times 128$ as the output 3D Gaussians for rendering.
  • ...and 4 more figures