Table of Contents
Fetching ...

Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning

Weihao Jiang, Chang Liu, Kun He

TL;DR

IMAFormer addresses few-shot classification by leveraging a Vision Transformer backbone with intra-task mutual attention, where patch tokens are exchanged between support and query within a task to reinforce same-class features. The model is pre-trained with self-supervised Masked Image Modeling to obtain meaningful representations and only requires fine-tuning of a small set of layers around the mutual-attention mechanism and CLS modules. Empirical results across five benchmarks show state-of-the-art performance in both 5-shot and 1-shot settings, with competitive efficiency and no reliance on external modules. The approach demonstrates that jointly attending to global and local information within a task can significantly enhance discriminability in few-shot scenarios, with strong generalization across datasets of varying resolution and granularity.

Abstract

Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. Such ability stems from their capacity to identify common features shared between new and previously seen images while disregarding distractions such as background variations. However, for artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. In this paper, we propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches and encoding them using the pre-trained Vision Transformer (ViT) architecture. Specifically, we swap the class (CLS) token and patch tokens between the support and query sets to have the mutual attention, which enables each set to focus on the most useful information. This facilitates the strengthening of intra-class representations and promotes closer proximity between instances of the same class. For implementation, we adopt the ViT-based network architecture and utilize pre-trained model parameters obtained through self-supervision. By leveraging Masked Image Modeling as a self-supervised training task for pre-training, the pre-trained model yields semantically meaningful representations while successfully avoiding supervision collapse. We then employ a meta-learning method to fine-tune the last several layers and CLS token modules. Our strategy significantly reduces the num- ber of parameters that require fine-tuning while effectively uti- lizing the capability of pre-trained model. Extensive experiments show that our framework is simple, effective and computationally efficient, achieving superior performance as compared to the state-of-the-art baselines on five popular few-shot classification benchmarks under the 5-shot and 1-shot scenarios

Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning

TL;DR

IMAFormer addresses few-shot classification by leveraging a Vision Transformer backbone with intra-task mutual attention, where patch tokens are exchanged between support and query within a task to reinforce same-class features. The model is pre-trained with self-supervised Masked Image Modeling to obtain meaningful representations and only requires fine-tuning of a small set of layers around the mutual-attention mechanism and CLS modules. Empirical results across five benchmarks show state-of-the-art performance in both 5-shot and 1-shot settings, with competitive efficiency and no reliance on external modules. The approach demonstrates that jointly attending to global and local information within a task can significantly enhance discriminability in few-shot scenarios, with strong generalization across datasets of varying resolution and granularity.

Abstract

Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. Such ability stems from their capacity to identify common features shared between new and previously seen images while disregarding distractions such as background variations. However, for artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. In this paper, we propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches and encoding them using the pre-trained Vision Transformer (ViT) architecture. Specifically, we swap the class (CLS) token and patch tokens between the support and query sets to have the mutual attention, which enables each set to focus on the most useful information. This facilitates the strengthening of intra-class representations and promotes closer proximity between instances of the same class. For implementation, we adopt the ViT-based network architecture and utilize pre-trained model parameters obtained through self-supervision. By leveraging Masked Image Modeling as a self-supervised training task for pre-training, the pre-trained model yields semantically meaningful representations while successfully avoiding supervision collapse. We then employ a meta-learning method to fine-tune the last several layers and CLS token modules. Our strategy significantly reduces the num- ber of parameters that require fine-tuning while effectively uti- lizing the capability of pre-trained model. Extensive experiments show that our framework is simple, effective and computationally efficient, achieving superior performance as compared to the state-of-the-art baselines on five popular few-shot classification benchmarks under the 5-shot and 1-shot scenarios
Paper Structure (16 sections, 8 equations, 4 figures, 6 tables)

This paper contains 16 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The main framework of our IMAformer model. We adopt a Transformer architecture with the first $L-1$ layers be used to encode the patched input images containing support and query. After obtaining the CLS token and patch tokens, we exchange the patch tokens between support and query, and the new combined tokens are sent to the last layer to get the final CLS token for the inputs.
  • Figure 2: Illustration of the IMAformer processing pipeline for 5-way 1-shot task. The support and query images are firstly patched and then the first $L-1$ layers of IMAformer are used to encode the input images. After getting the CLS token and patch tokens, the patch tokens between support and query are exchanged, and the new combination of tokens are sent to the last layer to get the final CLS token of input images. The CLS token of the same class between support and query will be strengthened, as marked by the dotted box.
  • Figure 3: Comparison on 5-way 5-shot performance with the model under different fine-tuning layers on $mini$ImageNet. ViT-Small is the backbone. 'w/o CLS' denotes fine-tuning excluding CLS and 'w/ CLS' denotes fine-tuning including CLS.
  • Figure 4: Qualitative visualization of model-based embedding before and after using intra-task mutual attention method on test tasks. Each figure shows the locations of PCA projected query embeddings (a) before and (b) after the adaptation of IMAformer. Values below are the 5-way 15-shot few-shot task before and after the adaptation. Obviously, the embedding adaptation step of IMAformer pushes the query embeddings toward their own clusters, such that they can better fits the test data of its categories.