Table of Contents
Fetching ...

Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting

Xiangyu Sun, Runnan Chen, Mingming Gong, Dong Xu, Tongliang Liu

TL;DR

This work tackles sparse-view novel view synthesis by harnessing priors from vision foundation models to overcome missing information. It jointly optimizes a dense, non-redundant Gaussian initialization via a dense multi-view stereo method (DUSt3R) and depth/appearance priors from both stereo and monocular predictions, augmented by diffusion-based appearance refinement for unseen views. The key contributions are (1) a dense, redundancy-free initialization strategy, (2) depth regularization using both training and pseudo views, (3) diffusion-guided multi-view appearance refinement, and (4) extensive ablations and state-of-the-art results on LLFF, DTU, and Tanks and Temples. Overall, Intern-GS significantly improves rendering quality in sparse-view scenarios, enabling robust, photo-realistic 3D reconstructions in texture-sparse and large-scale scenes with practical computation times.

Abstract

Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.

Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting

TL;DR

This work tackles sparse-view novel view synthesis by harnessing priors from vision foundation models to overcome missing information. It jointly optimizes a dense, non-redundant Gaussian initialization via a dense multi-view stereo method (DUSt3R) and depth/appearance priors from both stereo and monocular predictions, augmented by diffusion-based appearance refinement for unseen views. The key contributions are (1) a dense, redundancy-free initialization strategy, (2) depth regularization using both training and pseudo views, (3) diffusion-guided multi-view appearance refinement, and (4) extensive ablations and state-of-the-art results on LLFF, DTU, and Tanks and Temples. Overall, Intern-GS significantly improves rendering quality in sparse-view scenarios, enabling robust, photo-realistic 3D reconstructions in texture-sparse and large-scale scenes with practical computation times.

Abstract

Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.

Paper Structure

This paper contains 42 sections, 19 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of the SOTA SparseNeRF sparsenerf, SparseGS sparsegs in 3 training views. Our work leverage multi-view stereo prior to densely initial 3D Gaussian, supervised using a combination of various forms of regularization. From the reconstruction results in the figure, it is evident that our method significantly enhances rendering quality, yielding more refined and detailed results.
  • Figure 2: Comparison of point cloud initialization of original 3D Gaussian 3DGS and our method in 4 scenes under 3 training views. The first row's results are derived from SfM SFM used by the original 3D Gaussian and most NeRF-based methods. In contrast, the second row shows the results of our initialization method. Obviously, our method outperforms the SfM method in texture-poor areas.
  • Figure 3: In our framework, we first utilize a multi-view stereo to predict point maps. This technique recovers point maps at a consistent scale, but failed to represent scene because of redundancy in points. To handle overlapping regions in the point maps, we designed a Redundancy-Free (RF) algorithm that only initializes areas which have not been well defined for all views. For the optimization progress, we design a novel regularization method that jointly constrains the depth and color information of training and pseudo views. The color supervision is derived from the diffusion refine model we employ, while the depth supervision comes from multi-view stereo model and monocular depth prediction model.
  • Figure 4: Results on LLFF dataset LLFF and DTU dataset DTU in 3 training views. Our method captures more scene details, particularly in areas with sparse texture information. The SparseNeRF sparsenerf approach struggles to synthesize accurate new views under sparse viewpoints, while SparseGS sparsegs produces overly smooth views, losing many details.
  • Figure 5: Results on Tanks dataset tanks in 3 training views. In comparison, SparseNeRF sparsenerf struggles to accurately represent structures. While SparseGS sparsegs performs well overall, it tends to lose some texture information in areas with flat depth. In contrast, Intern-GS effectively captures these texture details.
  • ...and 4 more figures