Table of Contents
Fetching ...

Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields

Xingyu Miao, Haoran Duan, Yang Bai, Tejal Shah, Jun Song, Yang Long, Rajiv Ranjan, Ling Shao

TL;DR

Laser tackles the challenge of open-vocabulary 3D semantic segmentation in neural radiance fields by distilling dense CLIP features directly into NeRF with an efficient adapter and a self-cross-training strategy to suppress noise. It introduces a low-rank transient query attention to reduce the computational burden of 3D attention, and employs a label-volume mechanism to convert segmentation into a cross-view consistent classification problem. A simplified text augmentation and a light-weight text-guided refinement pipeline further mitigate ambiguity in CLIP-text alignment. Empirically, Laser achieves substantially faster training times and improved or competitive segmentation accuracy across multiple open-vocabulary 3D datasets compared to state-of-the-art methods, while keeping inference costs reasonable.

Abstract

In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.

Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields

TL;DR

Laser tackles the challenge of open-vocabulary 3D semantic segmentation in neural radiance fields by distilling dense CLIP features directly into NeRF with an efficient adapter and a self-cross-training strategy to suppress noise. It introduces a low-rank transient query attention to reduce the computational burden of 3D attention, and employs a label-volume mechanism to convert segmentation into a cross-view consistent classification problem. A simplified text augmentation and a light-weight text-guided refinement pipeline further mitigate ambiguity in CLIP-text alignment. Empirically, Laser achieves substantially faster training times and improved or competitive segmentation accuracy across multiple open-vocabulary 3D datasets compared to state-of-the-art methods, while keeping inference costs reasonable.

Abstract

In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.

Paper Structure

This paper contains 25 sections, 23 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Workflows of existing methods and Ours: (a) The core of the LERF/3D-OVS process initially adopts a cutting strategy that subdivides the training images into patches of different sizes. These patches are then fed into the CLIP encoder to extract multi-scale CLIP features, which are subsequently saved. At the same time, the original image is also input into DINO to extract DINO features. Afterwards, these multi-scale CLIP features and DINO features participate together in the segmentation branch optimization process of NeRF. Such a process demands significantly high computational and storage resources. (b) Compared to previous methods, our process only requires the input of training images into a modified CLIP encoder, after which it can predict dense CLIP features. Utilizing these features, the segmentation branch of NeRF can be optimized efficiently, while significantly reducing the consumption of computational and storage resources.
  • Figure 2: Interaction comparison of different modality. M=modality, N=new modality. (a) Fusion of features from different modalities and then interactive processing. When new modalities are included, retraining is required. (b) Directly interact with features of different modalities. When new modalities are added, they also need to be retrained. (c) Directly interact with the features of different modalities. After introducing a new modality, similar modal features can be used for feature distillation. (d) After introducing new modalities, we not only adopted the feature distillation strategy of similar modalities, but also directly processed the interactive features between different modalities.
  • Figure 3: Modality Graphs. T=text feature, I=image feature, S=segmentation feature of NeRF, L=label volume feature of NeRF. (a) Previous methods only distilled the image modality capabilities of CLIP into the segmentation branch of NeRF. (b) and (c) demonstrate our attempt to align the segmentation modality of NeRF with the image modality of CLIP, as discussed in \ref{['sec:3.1']}, where we introduced an adapter and a self-cross training strategy. (d) describes the self-enhancement of the NeRF segmentation modality, where in \ref{['sec:3.2']}, we proposed a low-rank transient query attention. (e) By incorporating the label modality of NeRF, we achieved bilateral alignment among four modalities, as shown in \ref{['sec:3.3']}, introducing label volume and $\mathcal{L}_{CE}$ loss. (f) pertains to the self-augmentation of the textual modality, where in \ref{['sec:3.4']}, we proposed a simplified text augmentation strategy.
  • Figure 4: Employing label volume to generate cluster centroids. Our method progressively aggregates points lying on the same ray into a shared cluster centroid during training. This process effectively groups 3D points, which are spatially represented on similar-looking features, into the same category. As a result, 3D points that share close appearances in their feature are associated with the same cluster, reinforcing the consistency of the categorization based on their color similarities.
  • Figure 5: Mitigating the ambiguity in CLIP features. We employ a simplified text augmentation strategy to standardize relevance maps. Observing the original relevance maps $Z_a$ and $Z_b$ in (a), we note that the relevance of class a within the red-highlighted area is higher than in other image regions. Due to the higher absolute relevance of class a in this area, the ambiguity of CLIP features results in the red region being classified as class a, even though class b is also present. In (b), we reduce this ambiguity by simply repeating the text to recalculate the relevance maps $Z_a$ and $Z_b$, thereby enhancing the accuracy of regional class assignments. In (c), standardizing the relevance maps of each class to a fixed range also can reduce ambiguity. In (d), we combine text repetition with standardization of the relevance maps, significantly reducing the ambiguity in classification and leading to more precise regional class allocations.
  • ...and 6 more figures