Table of Contents
Fetching ...

FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models

Wan Xu, Tianyu Huang, Tianyu Qu, Guanglei Yang, Yiwen Guo, Wangmeng Zuo

TL;DR

FILP-3D tackles 3D FSCIL by leveraging CLIP as a backbone to supply shape-related priors and mitigate domain gaps. It introduces two specialized modules, Redundant Feature Eliminator (RFE) and Spatial Noise Compensator (SNC), to align 3D and 2D representations and to recover geometric information from noisy scans, respectively. The authors also propose FSCIL3D-XL, an open benchmark with novel metrics such as $NCAcc$ and $F_{FSCIL}$ to better capture trade-offs between stability and plasticity. Across synthetic and real-world 3D benchmarks, FILP-3D achieves state-of-the-art results and demonstrates the value of vision-language priors for continual 3D learning, while providing a flexible, open evaluation platform for future work.

Abstract

Few-shot class-incremental learning (FSCIL) aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data. However, many of these works lack effective exploration of prior knowledge, rendering them unable to effectively address the domain gap issue in the context of 3D FSCIL, thereby leading to catastrophic forgetting. The Contrastive Vision-Language Pre-Training (CLIP) model serves as a highly suitable backbone for addressing the challenges of 3D FSCIL due to its abundant shape-related prior knowledge. Unfortunately, its direct application to 3D FSCIL still faces the incompatibility between 3D data representation and the 2D features, primarily manifested as feature space misalignment and significant noise. To address the above challenges, we introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise. RFE aligns the feature spaces of input point clouds and their embeddings by performing a unique dimensionality reduction on the feature space of pre-trained models (PTMs), effectively eliminating redundant information without compromising semantic integrity. On the other hand, SNC is a graph-based 3D model designed to capture robust geometric information within point clouds, thereby augmenting the knowledge lost due to projection, particularly when processing real-world scanned data. Moreover, traditional accuracy metrics are proven to be biased due to the imbalance in existing 3D datasets. Therefore we propose 3D FSCIL benchmark FSCIL3D-XL and novel evaluation metrics that offer a more nuanced assessment of a 3D FSCIL model. Experimental results on both established and our proposed benchmarks demonstrate that our approach significantly outperforms existing state-of-the-art methods.

FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models

TL;DR

FILP-3D tackles 3D FSCIL by leveraging CLIP as a backbone to supply shape-related priors and mitigate domain gaps. It introduces two specialized modules, Redundant Feature Eliminator (RFE) and Spatial Noise Compensator (SNC), to align 3D and 2D representations and to recover geometric information from noisy scans, respectively. The authors also propose FSCIL3D-XL, an open benchmark with novel metrics such as and to better capture trade-offs between stability and plasticity. Across synthetic and real-world 3D benchmarks, FILP-3D achieves state-of-the-art results and demonstrates the value of vision-language priors for continual 3D learning, while providing a flexible, open evaluation platform for future work.

Abstract

Few-shot class-incremental learning (FSCIL) aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data. However, many of these works lack effective exploration of prior knowledge, rendering them unable to effectively address the domain gap issue in the context of 3D FSCIL, thereby leading to catastrophic forgetting. The Contrastive Vision-Language Pre-Training (CLIP) model serves as a highly suitable backbone for addressing the challenges of 3D FSCIL due to its abundant shape-related prior knowledge. Unfortunately, its direct application to 3D FSCIL still faces the incompatibility between 3D data representation and the 2D features, primarily manifested as feature space misalignment and significant noise. To address the above challenges, we introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise. RFE aligns the feature spaces of input point clouds and their embeddings by performing a unique dimensionality reduction on the feature space of pre-trained models (PTMs), effectively eliminating redundant information without compromising semantic integrity. On the other hand, SNC is a graph-based 3D model designed to capture robust geometric information within point clouds, thereby augmenting the knowledge lost due to projection, particularly when processing real-world scanned data. Moreover, traditional accuracy metrics are proven to be biased due to the imbalance in existing 3D datasets. Therefore we propose 3D FSCIL benchmark FSCIL3D-XL and novel evaluation metrics that offer a more nuanced assessment of a 3D FSCIL model. Experimental results on both established and our proposed benchmarks demonstrate that our approach significantly outperforms existing state-of-the-art methods.
Paper Structure (25 sections, 15 equations, 6 figures, 12 tables)

This paper contains 25 sections, 15 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: CLIP learns a large amount of prior knowledge from massive image-text pairs. Thus pre-aligned image and text features contain sufficient shape-related prior knowledge. Along with the elimination of redundant information (RFE) and the compensator of 3D fine-grained information (SNC), the performance in 3D FSCIL can be significantly improved.
  • Figure 2: Overview of FILP-3D. FILP-3D mainly consists of three components, i.e., 3D branch (SNC), CLIP backbone, and the classify component (RFE). The SNC generates 3D feature $\mathbf{f}^{3D}$ from the input point cloud. The CLIP backbone also generates global 2D feature $\mathbf{f}^{2D}$ and text features ${\rm \mathbf{F}}^t$ from the input point cloud and the class names respectively. Then, the 3D feature $\mathbf{f}^{3D}$ and the global 2D feature $\mathbf{f}^{2D}$ will be fused as a global feature $\mathbf{f}^g$, and be used to calculate probability alongside text features after redundant dimensions eliminated in the RFE module.
  • Figure 3: The current feature space is generated by three principal components. Each dimension's one-hot vector represents a principal component. The first two dimensions contain semantic information, while the third dimension serves as a redundant component. By transforming feature vector (green) into the feature space mentioned above, we can notice that projection can eliminate redundant information, while normalization will improperly stretch the semantic information, leading to misclassification.
  • Figure 4: Overview of pre-processing
  • Figure 5: Visualization of experimental results.
  • ...and 1 more figures