Table of Contents
Fetching ...

SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs

Guibiao Liao, Qing Li, Zhenyu Bao, Guoping Qiu, Kanglin Liu

TL;DR

SPC-GS tackles indoor open-world free-view synthesis from sparse inputs by combining Scene-layout-based Gaussian Initialization (SGI) with Semantic-Prompt Consistency (SPC) Regularization. SGI densifies initial Gaussian points via generated adjacent views and view-constrained densification to create a scene-layout prior, improving geometric and semantic learning. SPC uses SAM2-driven region masks and semantic prompts from training views to enforce 2D and 3D semantic consistency on pseudo views, addressing limited view supervision. Across Replica and ScanNet, SPC-GS achieves higher reconstruction quality and open-world segmentation accuracy, demonstrating robustness with different CLIP-based semantic supervision and clear gains over state-of-the-art methods in sparse-input indoor scenes.

Abstract

3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3% improvement in mIoU for open-world semantic segmentation.

SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs

TL;DR

SPC-GS tackles indoor open-world free-view synthesis from sparse inputs by combining Scene-layout-based Gaussian Initialization (SGI) with Semantic-Prompt Consistency (SPC) Regularization. SGI densifies initial Gaussian points via generated adjacent views and view-constrained densification to create a scene-layout prior, improving geometric and semantic learning. SPC uses SAM2-driven region masks and semantic prompts from training views to enforce 2D and 3D semantic consistency on pseudo views, addressing limited view supervision. Across Replica and ScanNet, SPC-GS achieves higher reconstruction quality and open-world segmentation accuracy, demonstrating robustness with different CLIP-based semantic supervision and clear gains over state-of-the-art methods in sparse-input indoor scenes.

Abstract

3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3% improvement in mIoU for open-world semantic segmentation.

Paper Structure

This paper contains 23 sections, 9 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: Visual comparisons of open-world free-view synthesis from sparse inputs (12 training views). The previous 3DGS-based method, Gau-Grouping gaugrouping, utilizes sparse SfM points for Gaussian initialization and limited training views for supervision, leading to inferior rendering results. In contrast, our approach leverages scene-layout-based points for instructive initialization, and cooperates with semantic-prompt consistency constraints (detailed in Sec. \ref{['SPC_design']}), yielding superior reconstruction and segmentation results.
  • Figure 2: Framework of SPC-GS. (a) We first generate adjacent images of each training image using the video generation model motionctrl. These generated images, combined with the original training images, produce denser initialized SfM points. These points are then optimized to create a scene-layout Gaussian distribution via Gaussian densification and outlier removal. (b) Building on the scene-layout Gaussian initialization, SPC leverages semantic information from training views as instructive semantic prompts to optimize adjacent rendered pseudo views, establishing semantic consistency constraints that enhance overall sparse-input semantic understanding of 3D scenes.
  • Figure 3: Illustration of Iterative Stochastic Prompting (ISP). During training, ISP randomly samples point coordinates ($\bigstar$) from the training view, serving as the point prompt for SAM2 to produce region masks (highlighted in blue) across training and pseudo views.
  • Figure 4: Visual reconstruction results on novel views. Our approach achieves superior global structure and photo-realistic details.
  • Figure 5: Visual open-world segmentation results on novel views. Our approach yields more accurate and complete results.
  • ...and 12 more figures