Table of Contents
Fetching ...

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Hyunjee Lee, Youngsik Yun, Jeongmin Bae, Seoha Kim, Youngjung Uh

TL;DR

This work reframes open-vocabulary 3D understanding of radiance fields by performing segmentation directly in 3D space rather than producing 2D masks. It introduces a point-wise semantic supervision mechanism to learn a 3D language field, and then transfers this field to 3DGS to achieve real-time open-vocabulary rendering while maintaining accuracy. A new 3D semantic evaluation protocol evaluates both reconstructed geometry and semantics via mesh-based F1-scores, applicable to NeRF and 3DGS. The results demonstrate state-of-the-art 3D and 2D segmentation performance and establish the first real-time open-vocabulary rendering in this domain, enabling practical 3D scene understanding for robotics and immersive AI applications.

Abstract

Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents. While NeRFs and 3DGS excel at novel-view synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are rendered as 2D masks that do not represent the entire 3D space. To address this limitation, we redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding. We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels. We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. Lastly, we introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations are available at the project page.

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

TL;DR

This work reframes open-vocabulary 3D understanding of radiance fields by performing segmentation directly in 3D space rather than producing 2D masks. It introduces a point-wise semantic supervision mechanism to learn a 3D language field, and then transfers this field to 3DGS to achieve real-time open-vocabulary rendering while maintaining accuracy. A new 3D semantic evaluation protocol evaluates both reconstructed geometry and semantics via mesh-based F1-scores, applicable to NeRF and 3DGS. The results demonstrate state-of-the-art 3D and 2D segmentation performance and establish the first real-time open-vocabulary rendering in this domain, enabling practical 3D scene understanding for robotics and immersive AI applications.

Abstract

Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents. While NeRFs and 3DGS excel at novel-view synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are rendered as 2D masks that do not represent the entire 3D space. To address this limitation, we redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding. We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels. We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. Lastly, we introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations are available at the project page.
Paper Structure (33 sections, 5 equations, 11 figures, 13 tables)

This paper contains 33 sections, 5 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Previous works segment rendered 2D masks on rendered features to understand radiance fields. Instead, we reformulate the task to segment 3D volumes. Our approach significantly improves 3D understanding of radiance fields.
  • Figure 2: We propose 3D segmentation as a more practical problem setting, segmenting the 3D volume for a given text query (Section \ref{['sec:querying']}). Then we propose point-wise semantic loss to supervise the sampled point embeddings (Section \ref{['sec:pploss']}) Furthermore, the learned language fields can be transferred into 3DGS for faster rendering speeds (Section \ref{['sec:3dgs_lang']}). Lastly, our 3D evaluation protocol measures the 3D segmentation performance both in reconstructed geometry and semantics (Section \ref{['sec:3dseg']}).
  • Figure 3: Point-wise semantic loss supervises the language embeddings of sampled points directly in 3D space, ensuring precise semantics.
  • Figure 4: Transferring Ours-NeRF into 3DGS: We initialize 3DGS using the point cloud exported from our learned NeRF, then optimize the attributes of 3DGS except for position. The language features obtained by querying the language field at the Gaussian center positions are then transfer to 3DGS.
  • Figure 5: Comparison of 3D Evaluation: (a) Existing methods predict the labels at the ground truth point cloud. It is misleading when the language embeddings capture the object area while the reconstructed geometry does not cover that object area. (b) To address this problem, we extract 3D meshes from the segmented points of the reconstructed scene to measure the F1-score between the exported mesh and the ground truth mesh.
  • ...and 6 more figures