Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields
Xingyu Miao, Haoran Duan, Yang Bai, Tejal Shah, Jun Song, Yang Long, Rajiv Ranjan, Ling Shao
TL;DR
Laser tackles the challenge of open-vocabulary 3D semantic segmentation in neural radiance fields by distilling dense CLIP features directly into NeRF with an efficient adapter and a self-cross-training strategy to suppress noise. It introduces a low-rank transient query attention to reduce the computational burden of 3D attention, and employs a label-volume mechanism to convert segmentation into a cross-view consistent classification problem. A simplified text augmentation and a light-weight text-guided refinement pipeline further mitigate ambiguity in CLIP-text alignment. Empirically, Laser achieves substantially faster training times and improved or competitive segmentation accuracy across multiple open-vocabulary 3D datasets compared to state-of-the-art methods, while keeping inference costs reasonable.
Abstract
In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.
