Table of Contents
Fetching ...

Siamese Transformer Networks for Few-shot Image Classification

Weihao Jiang, Shuoxi Zhang, Kun He

TL;DR

The paper tackles few-shot image classification by integrating both global and local image cues through a Siamese Transformer Network (STN) built on dual Vision Transformer branches. Global similarity is measured with the squared Euclidean distance $D_{Ed}$ on class embeddings, while local similarity uses an asymmetric KL divergence $D_{KL}$ on patch embeddings; these are normalized with $L_2$ and combined via a fixed fusion weight $\alpha$ to produce a final similarity, enabling efficient nearest-neighbor classification. The method is trained with a metric-based, episodic meta-learning regime and does not rely on additional feature-adaptation modules, promoting generalization. Empirical results on miniImageNet, tieredImageNet, CIFAR-FS, and FC100 show STN achieving competitive or state-of-the-art performance in both 1-shot and 5-shot scenarios, validating the value of jointly leveraging global and local features in few-shot tasks.

Abstract

Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples. This ability is attributed to their capacity to focus on details and identify common features between previously seen and new images. In contrast, existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both. To address this limitation, we propose a novel approach based on the Siamese Transformer Network (STN). Our method employs two parallel branch networks utilizing the pre-trained Vision Transformer (ViT) architecture to extract global and local features, respectively. Specifically, we implement the ViT-Small network architecture and initialize the branch networks with pre-trained model parameters obtained through self-supervised learning. We apply the Euclidean distance measure to the global features and the Kullback-Leibler (KL) divergence measure to the local features. To integrate the two metrics, we first employ L2 normalization and then weight the normalized results to obtain the final similarity score. This strategy leverages the advantages of both global and local features while ensuring their complementary benefits. During the training phase, we adopt a meta-learning approach to fine-tune the entire network. Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules and enhancing the model's generalization ability. Extensive experiments demonstrate that our framework is simple yet effective, achieving superior performance compared to state-of-the-art baselines on four popular few-shot classification benchmarks in both 5-shot and 1-shot scenarios.

Siamese Transformer Networks for Few-shot Image Classification

TL;DR

The paper tackles few-shot image classification by integrating both global and local image cues through a Siamese Transformer Network (STN) built on dual Vision Transformer branches. Global similarity is measured with the squared Euclidean distance on class embeddings, while local similarity uses an asymmetric KL divergence on patch embeddings; these are normalized with and combined via a fixed fusion weight to produce a final similarity, enabling efficient nearest-neighbor classification. The method is trained with a metric-based, episodic meta-learning regime and does not rely on additional feature-adaptation modules, promoting generalization. Empirical results on miniImageNet, tieredImageNet, CIFAR-FS, and FC100 show STN achieving competitive or state-of-the-art performance in both 1-shot and 5-shot scenarios, validating the value of jointly leveraging global and local features in few-shot tasks.

Abstract

Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples. This ability is attributed to their capacity to focus on details and identify common features between previously seen and new images. In contrast, existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both. To address this limitation, we propose a novel approach based on the Siamese Transformer Network (STN). Our method employs two parallel branch networks utilizing the pre-trained Vision Transformer (ViT) architecture to extract global and local features, respectively. Specifically, we implement the ViT-Small network architecture and initialize the branch networks with pre-trained model parameters obtained through self-supervised learning. We apply the Euclidean distance measure to the global features and the Kullback-Leibler (KL) divergence measure to the local features. To integrate the two metrics, we first employ L2 normalization and then weight the normalized results to obtain the final similarity score. This strategy leverages the advantages of both global and local features while ensuring their complementary benefits. During the training phase, we adopt a meta-learning approach to fine-tune the entire network. Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules and enhancing the model's generalization ability. Extensive experiments demonstrate that our framework is simple yet effective, achieving superior performance compared to state-of-the-art baselines on four popular few-shot classification benchmarks in both 5-shot and 1-shot scenarios.
Paper Structure (18 sections, 14 equations, 4 figures, 9 tables)

This paper contains 18 sections, 14 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Illustration on limitations of either global feature-based methods or local feature-based methods. (a) When two images exhibit high similarity, global feature-based methods are susceptible to misjudgment. (b) Two images containing numerous similar details poses a challenge for local feature-based methods.
  • Figure 2: The processing pipeline of STN involves several key steps. Support and query images are first divided into patches and then encoded using a pre-trained Vision Transformer (ViT). Class embeddings capture the global features, while patch embeddings represent the local features. Using these two types of features, Euclidean distance and KL divergence are employed for measurement. Independent branch network optimization is performed based on their respective evaluation results. Finally, the scores from both measurements are normalized and weighted to generate the final similarity scores.
  • Figure 3: Results of different weights in the fusion strategy for few-shot classification on $mini$ImageNet.
  • Figure 4: Attention map visualization. The color intensity indicates the level of correlation between a local region and global information. Darker red shades signify higher correlation, while darker blue shades indicate lower correlation. By fusing global and local information, our proposed method reduces the weights assigned to semantics that are irrelevant to the global context.