Table of Contents
Fetching ...

SparseLGS: Sparse View Language Embedded Gaussian Splatting

Jun Hu, Zhang Chen, Zhong Li, Yi Xu, Juyong Zhang

TL;DR

SparseLGS tackles open-vocabulary 3D language field reconstruction from sparse, pose-free views. It integrates a learning-based dense stereo model (MASt3R) for robust pose and point-cloud initialization, a three-step multi-view semantic alignment to address cross-view inconsistencies, and a bijection mapping low-dimensional features to CLIP space to enable open-language queries without prohibitive storage. RGB supervision is incorporated during semantic training to constrain geometry, yielding accurate 3D semantics with only 3-4 views and achieving significant speedups over dense-view baselines. This work advances practical open-vocabulary 3D understanding by combining explicit 3D Gaussian Splatting with efficient semantic wiring and alignment under sparse input regimes, enabling fast, scalable 3D language fields for real-world applications.

Abstract

Recently, several studies have combined Gaussian Splatting to obtain scene representations with language embeddings for open-vocabulary 3D scene understanding. While these methods perform well, they essentially require very dense multi-view inputs, limiting their applicability in real-world scenarios. In this work, we propose SparseLGS to address the challenge of 3D scene understanding with pose-free and sparse view input images. Our method leverages a learning-based dense stereo model to handle pose-free and sparse inputs, and a three-step region matching approach to address the multi-view semantic inconsistency problem, which is especially important for sparse inputs. Different from directly learning high-dimensional CLIP features, we extract low-dimensional information and build bijections to avoid excessive learning and storage costs. We introduce a reconstruction loss during semantic training to improve Gaussian positions and shapes. To the best of our knowledge, we are the first to address the 3D semantic field problem with sparse pose-free inputs. Experimental results show that SparseLGS achieves comparable quality when reconstructing semantic fields with fewer inputs (3-4 views) compared to previous SOTA methods with dense input. Besides, when using the same sparse input, SparseLGS leads significantly in quality and heavily improves the computation speed (5$\times$speedup). Project page: https://ustc3dv.github.io/SparseLGS

SparseLGS: Sparse View Language Embedded Gaussian Splatting

TL;DR

SparseLGS tackles open-vocabulary 3D language field reconstruction from sparse, pose-free views. It integrates a learning-based dense stereo model (MASt3R) for robust pose and point-cloud initialization, a three-step multi-view semantic alignment to address cross-view inconsistencies, and a bijection mapping low-dimensional features to CLIP space to enable open-language queries without prohibitive storage. RGB supervision is incorporated during semantic training to constrain geometry, yielding accurate 3D semantics with only 3-4 views and achieving significant speedups over dense-view baselines. This work advances practical open-vocabulary 3D understanding by combining explicit 3D Gaussian Splatting with efficient semantic wiring and alignment under sparse input regimes, enabling fast, scalable 3D language fields for real-world applications.

Abstract

Recently, several studies have combined Gaussian Splatting to obtain scene representations with language embeddings for open-vocabulary 3D scene understanding. While these methods perform well, they essentially require very dense multi-view inputs, limiting their applicability in real-world scenarios. In this work, we propose SparseLGS to address the challenge of 3D scene understanding with pose-free and sparse view input images. Our method leverages a learning-based dense stereo model to handle pose-free and sparse inputs, and a three-step region matching approach to address the multi-view semantic inconsistency problem, which is especially important for sparse inputs. Different from directly learning high-dimensional CLIP features, we extract low-dimensional information and build bijections to avoid excessive learning and storage costs. We introduce a reconstruction loss during semantic training to improve Gaussian positions and shapes. To the best of our knowledge, we are the first to address the 3D semantic field problem with sparse pose-free inputs. Experimental results show that SparseLGS achieves comparable quality when reconstructing semantic fields with fewer inputs (3-4 views) compared to previous SOTA methods with dense input. Besides, when using the same sparse input, SparseLGS leads significantly in quality and heavily improves the computation speed (5speedup). Project page: https://ustc3dv.github.io/SparseLGS

Paper Structure

This paper contains 33 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: We present the semantic renderings from sparse, pose-free inputs using our method and LangSplat qin2023langsplat. Our method outperforms LangSplat in both multi-view consistency and rendering quality, producing more accurate and visually coherent results.
  • Figure 2: Our approach SparseLGS is capable of generating high-quality language fields from pose-free sparse view inputs in just a few minutes. We first leverage SAM and CLIP to obtain object-wise semantic maps, then use a learning-based stereo model to derive camera poses and point clouds from sparse inputs. To address semantic inconsistencies across views, we employ a three-step multi-view semantic alignment strategy. To better integrate semantics with Gaussian Splatting, we establish a bijection between the original CLIP features and their dimensionality-reduced counterparts. During training, we incorporate RGB supervision to enhance the 3D consistency of our learned language field.
  • Figure 3: Open-vocabulary 3D object localization Experiments on the LERF datasets. The black dashed box represents the GT bounding box of the query object, while the red dots indicate the predicted locations of the query objects by each method.
  • Figure 4: Open-vocabulary 3D semantic segmentation on the LERF dataset.
  • Figure 5: Open-vocabulary 3D semantic segmentation on the 3D-OVS dataset.
  • ...and 4 more figures