Table of Contents
Fetching ...

PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification

Meijuan Su, Feihong He, Fanzhang Li

TL;DR

This work addresses the challenge of few-shot image classification by learning relationships among class prototypes. It introduces PrototypeFormer, combining a transformer-based Prototype Extraction Module with a prototype contrastive loss, built on a frozen CLIP backbone to produce discriminative prototypes and sub-prototypes in $N$-way $K$-shot episodes. The approach achieves state-of-the-art results on miniImageNet (e.g., 97.07% for 5-way 5-shot) and demonstrates strong gains on tieredImageNet and CUB-200, validating the effectiveness of prototype-relational modeling. The method offers a simple yet powerful way to enhance prototype representations and generalize across diverse datasets, with clear visualization evidence of improved class separability.

Abstract

Few-shot image classification has received considerable attention for overcoming the challenge of limited classification performance with limited samples in novel classes. Most existing works employ sophisticated learning strategies and feature learning modules to alleviate this challenge. In this paper, we propose a novel method called PrototypeFormer, exploring the relationships among category prototypes in the few-shot scenario. Specifically, we utilize a transformer architecture to build a prototype extraction module, aiming to extract class representations that are more discriminative for few-shot classification. Besides, during the model training process, we propose a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios. Despite its simplicity, our method performs remarkably well, with no bells and whistles. We have experimented with our approach on several popular few-shot image classification benchmark datasets, which shows that our method outperforms all current state-of-the-art methods. In particular, our method achieves 97.07\% and 90.88\% on 5-way 5-shot and 5-way 1-shot tasks of miniImageNet, which surpasses the state-of-the-art results with accuracy of 0.57\% and 6.84\%, respectively. The code will be released later.

PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification

TL;DR

This work addresses the challenge of few-shot image classification by learning relationships among class prototypes. It introduces PrototypeFormer, combining a transformer-based Prototype Extraction Module with a prototype contrastive loss, built on a frozen CLIP backbone to produce discriminative prototypes and sub-prototypes in -way -shot episodes. The approach achieves state-of-the-art results on miniImageNet (e.g., 97.07% for 5-way 5-shot) and demonstrates strong gains on tieredImageNet and CUB-200, validating the effectiveness of prototype-relational modeling. The method offers a simple yet powerful way to enhance prototype representations and generalize across diverse datasets, with clear visualization evidence of improved class separability.

Abstract

Few-shot image classification has received considerable attention for overcoming the challenge of limited classification performance with limited samples in novel classes. Most existing works employ sophisticated learning strategies and feature learning modules to alleviate this challenge. In this paper, we propose a novel method called PrototypeFormer, exploring the relationships among category prototypes in the few-shot scenario. Specifically, we utilize a transformer architecture to build a prototype extraction module, aiming to extract class representations that are more discriminative for few-shot classification. Besides, during the model training process, we propose a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios. Despite its simplicity, our method performs remarkably well, with no bells and whistles. We have experimented with our approach on several popular few-shot image classification benchmark datasets, which shows that our method outperforms all current state-of-the-art methods. In particular, our method achieves 97.07\% and 90.88\% on 5-way 5-shot and 5-way 1-shot tasks of miniImageNet, which surpasses the state-of-the-art results with accuracy of 0.57\% and 6.84\%, respectively. The code will be released later.
Paper Structure (17 sections, 8 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Samples from different categories exhibit both shared features and distinctive features. For example, the red rectangle indicates the similarity features among different categories, while the purple rectangle represents dissimilar features across different categories.
  • Figure 2: This figure presents the overall process flowchart of the method proposed in this paper. We linearly combine the support set and obtain sub-prototypes through the prototype extraction module. The sub-prototypes are utilized for computing the prototype contrastive loss $L_{prototype}$, while the prototype is employed for calculating the classification loss $L_{classifier}$. We sum the $L_{prototype}$ and $L_{classifier}$ to obtain the final optimization objective.
  • Figure 3: The prototype extraction module adopts the transformer structure Transformer, taking the prototype token and embeddings of same-class images from the support set as inputs to obtain the prototype and sub-prototype for that class.
  • Figure 4: We randomly select eight task sets from the test dataset and visualize their feature embeddings using t-SNE TSNE. In the visualization, circular points represent query samples, triangles represent prototype points obtained by averaging the support set, and pentagrams represent class feature embeddings obtained through our proposed method in this paper.
  • Figure 5: We randomly choose 5 categories from the test set, with 15 samples in each category, and create their similarity matrix. In the visualization, yellow areas show correct classifications, while blue areas indicate misclassifications.