Table of Contents
Fetching ...

KNN Transformer with Pyramid Prompts for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Peng Zhao, Yilong Yin

TL;DR

A K-NN Transformer with Pyramid Prompts is proposed to select discriminative information with K-NN Context Attention and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP), allowing the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations.

Abstract

Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.

KNN Transformer with Pyramid Prompts for Few-Shot Learning

TL;DR

A K-NN Transformer with Pyramid Prompts is proposed to select discriminative information with K-NN Context Attention and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP), allowing the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations.

Abstract

Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.

Paper Structure

This paper contains 23 sections, 13 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The image annotated as "dog" contains abundant spurious information such as people, walls, etc. Moreover, the scale variations present across different images, thus limiting the performance of ViTs in FSL. Our KTPP achieves coarse-to-fine filtration of noise, adaptation to spatial variations, and prompts-guided class-specific visual extraction.
  • Figure 2: The framework of our proposed KTPP. Image patches are sequentially fed into three cascaded stages where KCA filters out irrelevant tokens and linear projection extracts multi-scale support features $Z_v^S$. Text-based class-aware prompts $Z_t^{cp}$ are exploited via the CLIP. Pyramid prompts $Z_t^{pp}$ are obtained to enhance support features via the PCP between $Z_t^{cp}$ and $Z_v^S$. The enhanced support features and $Z_t^{cp}$ are interacted via the KCA to extract class-specific support features.
  • Figure 3: Illustration of K-NN Context Attention (KCA). For each query, the $k$ highest scorers from the N query-key pairs are selected for computing attention weights, and the rest are set to negative infinity. The context prompt is the mean of all tokens weighted by the KCA.
  • Figure 4: Illustration of cross-modal enhancement module in the PCP. Class-aware prompts $Z_t^{cp}$ provide queries $Q$, dynamically adjusting the importance weights of visual features.
  • Figure 5: T-SNE visualization results on novel classes from three benchmark datasets. ViT with KTPP performs better compared to ViT with baseline.