Table of Contents
Fetching ...

SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain

Butian Xiong, Xiaoyu Ye, Tze Ho Elden Tse, Kai Han, Shuguang Cui, Zhen Li

TL;DR

This work addresses large-scale 3D scene reconstruction by incorporating semantic information into Gaussian Splatting to mitigate fantasy-surface and inconsistency issues. It introduces SA-GS, a three-stage pipeline that (i) generates semantic masks and derives per-semantic-group target shapes guided by a geometric-complexity prior, (ii) enforces a soft, per-Gaussian regularizer during training, and (iii) uses a hierarchical probability-density sampling to extract a geometry-faithful point cloud. Key contributions include a frequency-based perplexity measure $\mathbf{P_j}$ that bounds per-semantic-group splat counts, a soft regularization loss $\mathcal{L}_{gc}$, and a density-based point extraction scheme with $\phi(x)$, all aimed at preserving semantic detail while controlling memory use. Experiments on GauUscene-based datasets and a campus dataset demonstrate significant improvements in geometric metrics over state-of-the-art Gaussian splat methods and competitive image-based render quality. The approach enables detailed semantic queries and improved geometry, with practical impact for large-scale scene understanding and downstream tasks, while noting limitations related to semantic consistency and reliance on external masks.

Abstract

With the emergence of Gaussian Splats, recent efforts have focused on large-scale scene geometric reconstruction. However, most of these efforts either concentrate on memory reduction or spatial space division, neglecting information in the semantic space. In this paper, we propose a novel method, named SA-GS, for fine-grained 3D geometry reconstruction using semantic-aware 3D Gaussian Splats. Specifically, we leverage prior information stored in large vision models such as SAM and DINO to generate semantic masks. We then introduce a geometric complexity measurement function to serve as soft regularization, guiding the shape of each Gaussian Splat within specific semantic areas. Additionally, we present a method that estimates the expected number of Gaussian Splats in different semantic areas, effectively providing a lower bound for Gaussian Splats in these areas. Subsequently, we extract the point cloud using a novel probability density-based extraction method, transforming Gaussian Splats into a point cloud crucial for downstream tasks. Our method also offers the potential for detailed semantic inquiries while maintaining high image-based reconstruction results. We provide extensive experiments on publicly available large-scale scene reconstruction datasets with highly accurate point clouds as ground truth and our novel dataset. Our results demonstrate the superiority of our method over current state-of-the-art Gaussian Splats reconstruction methods by a significant margin in terms of geometric-based measurement metrics. Code and additional results will soon be available on our project page.

SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain

TL;DR

This work addresses large-scale 3D scene reconstruction by incorporating semantic information into Gaussian Splatting to mitigate fantasy-surface and inconsistency issues. It introduces SA-GS, a three-stage pipeline that (i) generates semantic masks and derives per-semantic-group target shapes guided by a geometric-complexity prior, (ii) enforces a soft, per-Gaussian regularizer during training, and (iii) uses a hierarchical probability-density sampling to extract a geometry-faithful point cloud. Key contributions include a frequency-based perplexity measure that bounds per-semantic-group splat counts, a soft regularization loss , and a density-based point extraction scheme with , all aimed at preserving semantic detail while controlling memory use. Experiments on GauUscene-based datasets and a campus dataset demonstrate significant improvements in geometric metrics over state-of-the-art Gaussian splat methods and competitive image-based render quality. The approach enables detailed semantic queries and improved geometry, with practical impact for large-scale scene understanding and downstream tasks, while noting limitations related to semantic consistency and reliance on external masks.

Abstract

With the emergence of Gaussian Splats, recent efforts have focused on large-scale scene geometric reconstruction. However, most of these efforts either concentrate on memory reduction or spatial space division, neglecting information in the semantic space. In this paper, we propose a novel method, named SA-GS, for fine-grained 3D geometry reconstruction using semantic-aware 3D Gaussian Splats. Specifically, we leverage prior information stored in large vision models such as SAM and DINO to generate semantic masks. We then introduce a geometric complexity measurement function to serve as soft regularization, guiding the shape of each Gaussian Splat within specific semantic areas. Additionally, we present a method that estimates the expected number of Gaussian Splats in different semantic areas, effectively providing a lower bound for Gaussian Splats in these areas. Subsequently, we extract the point cloud using a novel probability density-based extraction method, transforming Gaussian Splats into a point cloud crucial for downstream tasks. Our method also offers the potential for detailed semantic inquiries while maintaining high image-based reconstruction results. We provide extensive experiments on publicly available large-scale scene reconstruction datasets with highly accurate point clouds as ground truth and our novel dataset. Our results demonstrate the superiority of our method over current state-of-the-art Gaussian Splats reconstruction methods by a significant margin in terms of geometric-based measurement metrics. Code and additional results will soon be available on our project page.
Paper Structure (19 sections, 13 equations, 6 figures, 4 tables)

This paper contains 19 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Qualitative comparison between our method and other 3DGS based methods. We proposed Shape constrain, alpha constrain and point cloud extraction in the current study. Quantitative ablation is shown in the right handside of the figure.
  • Figure 2: Overview: The blue section of the figure illustrates common methods for reconstructing geometrically aligned Gaussian Splats. The input for all Gaussian Splatting methods includes a COLMAP initialization consisting of images, camera positions, and SfM sparse point clouds. The output will be a traditional representation such as a mesh or point cloud, as shown in the right blue box. During training, in addition to the common image rendering loss, most methods encourage all 3D Gaussians to form a disk-like shape, as seen in SuGaR and 2DGS. After several training iterations, or at the end of the training process, other methods select a hard threshold for the alpha value and use the remaining Gaussians for geometric reconstruction. However, these hard constraints often result in poorer reconstruction, as demonstrated in our experiments. Instead of encouraging all Gaussians to adopt the same shape, our method uses semantic information to control the shape in detail. We first produce semantic masks for each input image, then extract shape information for each semantic group, and use this information to locally control the shape of each Gaussian. Additionally, we provide an opacity field sampling method that can dynamically allocate the desired number of points and ignore defective reconstruction parts.
  • Figure 3: Explanation of Fantasy-Surface Problem: In the first row of this figure, we display the results of using SuGaR SuGaR to reconstruct the Campus and College scenes from GauUsceneV2 GauU_V2. Many surfaces incorrectly model the lighting conditions due to complex effects, such as how glass reflects sunlight at different angles and how clouds block sunlight. These imaginary surfaces, which do not represent the true surface, are regarded as fantasy surfaces. Our method, shown in the bottom rows, largely alleviates this problem, as evident in the figure. Another major source of geometric error occurs at the edges of unbounded scenes. However, this issue is common to all methods due to the sparsity of images at the edges and is not the focus of our current work.
  • Figure 4: Explanation of Inconsistency problem. The semantic segmentation results are sometimes inconsistent with previous judgments. As shown in Figures (a) and (b), two tunnels are regarded as ground using GroundingSAM. However, in the images captured from a camera position immediately adjacent to them (Figures (c) and (d)), the left tunnel is not regarded as ground. This inconsistency between consecutive images is the primary cause of failure in naive reconstruction methods.
  • Figure 5: Method Overview: Our method pipeline consists of three main stages. Initially, we utilize the same input as vanilla Gaussian Splatting, but enhance it with semantic information extracted via Grounding SAM. Next, we assess the geometric complexity of each semantic group by calculating high-frequency power. Our geometric constraint is implemented through a soft regularization, facilitated by a semantic loss function. This guides the Gaussian shapes to match the expected shapes determined earlier. The rendering loss further refines the shape and attributes of the 3DGS, while the shape constraint, indicated by a negative sign, ensures alignment between rendered and real images. Controlling the shapes of different 3DGS is achieved by mapping their projected pixels onto the semantic map obtained earlier. Additionally, by reducing the number of low-opacity Gaussian splats to the expected count, we minimize GPU memory consumption during training. Finally, we offer a user-friendly point cloud extraction method via hierarchical probability density sampling. Initially, we create a multinomial distribution using the opacity values stored in each 3DGS. Then, based on user inputs and the multinomial distribution, we determine the number of points to sample from each Gaussian distribution. Detailed experimental results demonstrate significant improvements at each step, showcasing superior geometric reconstruction compared to current state-of-the-art methods.
  • ...and 1 more figures