Table of Contents
Fetching ...

PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition

Dongyun Lin, Yi Cheng, Shangbo Mao, Aiyuan Guo, Yiqun Li

TL;DR

PEVA-Net introduces a CLIP-based, prompt-enhanced framework for zero-/few-shot multi-view 3D shape recognition. It combines a prompt-guided view-aggregation module to form a discriminative zero-shot descriptor and a ViT-based encoder with a self-distillation loss to align few-shot descriptors to the zero-shot reference. The approach achieves state-of-the-art zero-shot performance on ModelNet40/ModelNet10 and ShapeNetCore 55, and strong few-shot results (e.g., 16-shot 90.64% on ModelNet40) without 3D pretraining. This work demonstrates that prompt design and cross-modal guidance can significantly reduce data requirements while maintaining high recognition accuracy in 3D vision tasks.

Abstract

Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by multiple view images under the scenarios of either without explicit training (zero-shot 3D shape recognition) or training with a limited number of data (few-shot 3D shape recognition). We analyze that both tasks are relevant and can be considered simultaneously. Specifically, leveraging the descriptor which is effective for zero-shot inference to guide the tuning of the aggregated descriptor under the few-shot training can significantly improve the few-shot learning efficacy. Hence, we propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition. Under the zero-shot scenario, we propose to leverage the prompts built up from candidate categories to enhance the aggregation process of multiple view-associated visual features. The resulting aggregated feature serves for effective zero-shot recognition of the 3D shapes. Under the few-shot scenario, we first exploit a transformer encoder to aggregate the view-associated visual features into a global descriptor. To tune the encoder, together with the main classification loss, we propose a self-distillation scheme via a feature distillation loss by treating the zero-shot descriptor as the guidance signal for the few-shot descriptor. This scheme can significantly enhance the few-shot learning efficacy.

PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition

TL;DR

PEVA-Net introduces a CLIP-based, prompt-enhanced framework for zero-/few-shot multi-view 3D shape recognition. It combines a prompt-guided view-aggregation module to form a discriminative zero-shot descriptor and a ViT-based encoder with a self-distillation loss to align few-shot descriptors to the zero-shot reference. The approach achieves state-of-the-art zero-shot performance on ModelNet40/ModelNet10 and ShapeNetCore 55, and strong few-shot results (e.g., 16-shot 90.64% on ModelNet40) without 3D pretraining. This work demonstrates that prompt design and cross-modal guidance can significantly reduce data requirements while maintaining high recognition accuracy in 3D vision tasks.

Abstract

Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by multiple view images under the scenarios of either without explicit training (zero-shot 3D shape recognition) or training with a limited number of data (few-shot 3D shape recognition). We analyze that both tasks are relevant and can be considered simultaneously. Specifically, leveraging the descriptor which is effective for zero-shot inference to guide the tuning of the aggregated descriptor under the few-shot training can significantly improve the few-shot learning efficacy. Hence, we propose Prompt-Enhanced View Aggregation Network (PEVA-Net) to simultaneously address zero/few-shot 3D shape recognition. Under the zero-shot scenario, we propose to leverage the prompts built up from candidate categories to enhance the aggregation process of multiple view-associated visual features. The resulting aggregated feature serves for effective zero-shot recognition of the 3D shapes. Under the few-shot scenario, we first exploit a transformer encoder to aggregate the view-associated visual features into a global descriptor. To tune the encoder, together with the main classification loss, we propose a self-distillation scheme via a feature distillation loss by treating the zero-shot descriptor as the guidance signal for the few-shot descriptor. This scheme can significantly enhance the few-shot learning efficacy.
Paper Structure (24 sections, 12 equations, 7 figures, 6 tables)

This paper contains 24 sections, 12 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison between (a) the trivial view aggregation scheme via pooling; (b) our proposed prompt-enhanced view aggregation by leveraging the prompt-associated and the view-associated features.
  • Figure 2: The empirical observations to show the effectiveness of the proposed self-distillation scheme: (a) The recognition accuracy on ModelNet40 test set across the training epochs under 16-shot setting; (b) and (c): 2D t-SNE embeddings produced by PEVA-Net without feature distillation and PEVA-Net with feature distillation, respectively, on the testing samples from 10 categories.
  • Figure 3: The overall architecture of the proposed PEVA-Net
  • Figure 4: The architecture of ViT Encoder.
  • Figure 5: Few-Shot Recognition Performance on ModelNet40 Under Different Number of Shots.
  • ...and 2 more figures