Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space
Hyunjee Lee, Youngsik Yun, Jeongmin Bae, Seoha Kim, Youngjung Uh
TL;DR
This work reframes open-vocabulary 3D understanding of radiance fields by performing segmentation directly in 3D space rather than producing 2D masks. It introduces a point-wise semantic supervision mechanism to learn a 3D language field, and then transfers this field to 3DGS to achieve real-time open-vocabulary rendering while maintaining accuracy. A new 3D semantic evaluation protocol evaluates both reconstructed geometry and semantics via mesh-based F1-scores, applicable to NeRF and 3DGS. The results demonstrate state-of-the-art 3D and 2D segmentation performance and establish the first real-time open-vocabulary rendering in this domain, enabling practical 3D scene understanding for robotics and immersive AI applications.
Abstract
Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents. While NeRFs and 3DGS excel at novel-view synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are rendered as 2D masks that do not represent the entire 3D space. To address this limitation, we redefine the problem to segment the 3D volume and propose the following methods for better 3D understanding. We directly supervise the 3D points to train the language embedding field, unlike previous methods that anchor supervision at 2D pixels. We transfer the learned language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. Lastly, we introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations are available at the project page.
