Table of Contents
Fetching ...

LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding

Hao Li, Roy Qin, Zhengyu Zou, Diqi He, Bohan Li, Bingquan Dai, Dingewn Zhang, Junwei Han

TL;DR

LangSurf tackles open-vocabulary 3D scene understanding by aligning language features with object surfaces in a Gaussian-based 3D field. It introduces a Hierarchical-Context Awareness Module to inject global context and a three-stage Language-Embedded Training regime to co-optimize geometry and language features, producing surface-consistent semantic Gaussians. Empirical results on LERF and ScanNet show state-of-the-art gains for 2D and 3D open-vocabulary segmentation, and the approach enables practical 3D object removal and editing. Overall, LangSurf advances text-guided 3D perception by ensuring semantic language is tightly anchored to 3D surfaces, improving both recognition and manipulability in real-world scenes.

Abstract

Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To this end, we propose a Language-Embedded Surface Field (LangSurf), which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical-Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine-grained language features in different hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous state-of-the-art method LangSplat by a large margin. As shown in Fig. 1, our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments. \url{https://langsurf.github.io}.

LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding

TL;DR

LangSurf tackles open-vocabulary 3D scene understanding by aligning language features with object surfaces in a Gaussian-based 3D field. It introduces a Hierarchical-Context Awareness Module to inject global context and a three-stage Language-Embedded Training regime to co-optimize geometry and language features, producing surface-consistent semantic Gaussians. Empirical results on LERF and ScanNet show state-of-the-art gains for 2D and 3D open-vocabulary segmentation, and the approach enables practical 3D object removal and editing. Overall, LangSurf advances text-guided 3D perception by ensuring semantic language is tightly anchored to 3D surfaces, improving both recognition and manipulability in real-world scenes.

Abstract

Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To this end, we propose a Language-Embedded Surface Field (LangSurf), which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical-Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine-grained language features in different hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous state-of-the-art method LangSplat by a large margin. As shown in Fig. 1, our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments. \url{https://langsurf.github.io}.

Paper Structure

This paper contains 27 sections, 8 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: We proposed LangSurf, a model that aligns language features with object surfaces to enhance 3D scene understanding. The top left and right panels illustrate the qualitative differences between LangSurf and LangSplat langsplat in reconstructing the 3D language field from multiple viewpoints, demonstrating LangSurf's superior alignment of semantic features with object surfaces. The bottom row shows a variety of downstream applications enabled by LangSurf.
  • Figure 2: Overview of proposed LangSurf. Given input views, we reconstruct a language-embedded surface field to enable 2D / 3D open-vocabulary segmentation as well as downstream tasks. Our pipeline contains two main steps: 1) Hierarchical-Context Awareness Module extracts context-aware features with multiple hierarchies (Sec. \ref{['sec:hierarchical']}); 2) Language-Embedded Training utilizes a joint training strategy to construct language-embedded surface field (Sec. \ref{['sec:joint']}).
  • Figure 3: 2D Qualitative Results on LERF Datasets. Here we showcase two scenes (i.e. Teatime, Waldo Kitchen) with multiple text-query segmentation masks. On the left, we present the images alongside the queried texts. On the right, we display the rendered results of our method and other methods, along with the corresponding ground truth annotations.
  • Figure 4: 2D Qualitative Results on ScanNet scannet Dataset. Here we showcase two scenes (i.e. scene0085_00, scene0617_00) with multiple text-query segmentation masks. The masks predicted by ours contain more comprehensive regions and sharper boundaries than other methods, such as the "Table" prompt, which also surpasses the GT masks.
  • Figure 5: 3D Qualitative Results on both LERF lerf and ScanNet scannet Datasets. We compare our method with other GS-based methods langsplatgaussian-grouping. We show the queried point clouds with activated score and mesh corresponding to the given text.
  • ...and 10 more figures