Table of Contents
Fetching ...

Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation

Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Min Wu, Ming-Ming Cheng, Ender Konukoglu, Serge Belongie

TL;DR

The paper tackles FS-PCS by introducing a cost-free multimodal setup that exploits textual class names and implicitly available 2D images during pretraining. It presents MM-FSS, a multimodal FS-PCS model with two feature heads (IF and UF), and two fusion modules (MCF and MSF) to fuse intermodal, unimodal, and textual information, complemented by a test-time Adaptive Cross-modal Calibration (TACC). Through a two-stage training pipeline—2D-aligned pretraining followed by episodic meta-learning—MM-FSS achieves consistent, significant gains over state-of-the-art methods on S3DIS and ScanNet, demonstrating the value of free modalities for few-shot 3D segmentation. The work provides practical insights into multimodal integration for FS-PCS and offers a reproducible approach with publicly available code, highlighting avenues for further research in multimodal, few-shot perception.

Abstract

Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at https://github.com/ZhaochongAn/Multimodality-3D-Few-Shot

Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation

TL;DR

The paper tackles FS-PCS by introducing a cost-free multimodal setup that exploits textual class names and implicitly available 2D images during pretraining. It presents MM-FSS, a multimodal FS-PCS model with two feature heads (IF and UF), and two fusion modules (MCF and MSF) to fuse intermodal, unimodal, and textual information, complemented by a test-time Adaptive Cross-modal Calibration (TACC). Through a two-stage training pipeline—2D-aligned pretraining followed by episodic meta-learning—MM-FSS achieves consistent, significant gains over state-of-the-art methods on S3DIS and ScanNet, demonstrating the value of free modalities for few-shot 3D segmentation. The work provides practical insights into multimodal integration for FS-PCS and offers a reproducible approach with publicly available code, highlighting avenues for further research in multimodal, few-shot perception.

Abstract

Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at https://github.com/ZhaochongAn/Multimodality-3D-Few-Shot

Paper Structure

This paper contains 19 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Comparison between traditional unimodal FS-PCS and our proposed multimodal FS-PCS. Previous FS-PCS methods only make use of point clouds as unimodal input. In contrast, our proposed model utilizes multimodal information without additional annotation cost to improve FS-PCS by considering the textual modality of class names (explicit) and learning simulated features of the 2D modality (implicit). During meta-learning and inference, the 2D modality is not needed.
  • Figure 2: Overall architecture of the proposed MM-FSS. Given support and query point clouds, we first generate intermodal features $\mathbf{F}_{\rm s/q}^{\rm i}$ from the IF head and unimodal features $\mathbf{F}_{\rm s/q}^{\rm u}$ from the UF head. These features are then forwarded to the MCF module to generate initial multimodal correlations $\mathbf{C}_{\rm 0}$. Moreover, exploiting the alignment between intermodal features $\mathbf{F}_{\rm q}^{\rm i}$ and text embeddings $\mathbf{T}$, we use their affinity $\mathbf{G}_{\rm q}$ as the informative textual semantic guidance to refine the multimodal correlations in the MSF modules. Finally, we propose the TACC, a parameter-free module that adaptively calibrates predictions during test time to effectively mitigate the base bias issue. For clarity, we present the model under the 1-way 1-shot setting.
  • Figure 3: Qualitative comparison between COSeg and our proposed MM-FSS in the $1$-way $1$-shot setting on the S3DIS dataset. The target classes in the first and second rows are sofa and window, respectively. Colored circles highlight regions where predictions from COSeg and MM-FSS differ significantly to facilitate visual comparison.
  • Figure 4: Qualitative comparison of predictions from each head and our final prediction using TACC (Default) in the $1$-way $1$-shot setting on the S3DIS dataset. The target classes in the first and second rows are door and board, respectively.
  • Figure 5: Visualization on the effects of weight $\mathbf{W}_{\rm q}$ between textual and visual modalities in \ref{['eq:Wqsum']}. The last column displays the heatmap of $\mathbf{W}_{\rm q}$ with the color bar referenced at the top. Higher values indicate larger weights assigned to textual guidance $\mathbf{G}_{\rm q}$. Each row represents the $1$-way $1$-shot setting on the S3DIS dataset targeting bookcase and table, respectively, arranged from top to bottom.
  • ...and 3 more figures