Table of Contents
Fetching ...

Class-relevant Patch Embedding Selection for Few-Shot Image Classification

Weihao Jiang, Haoyang Cui, Kun He

TL;DR

This work tackles foreground-background interference in few-shot image classification by introducing Class-relevant Patch Embedding Selection (CPES). CPES uses a pre-trained Vision Transformer with Masked Image Modeling to extract a global class embedding and local patch embeddings, then selects class-relevant patches via their cosine similarity to the class embedding and fuses them with the class embedding for robust image representations. It computes a dense patch-wise similarity matrix between support and query representations and scores it with an MLP, all without adding extra learnable parameters for patch weighting. Extensive experiments on four standard benchmarks show CPES achieves strong performance, often surpassing state-of-the-art baselines, and extensions to existing methods confirm its flexibility and practical impact for few-shot learning. By focusing on semantically relevant patches and leveraging self-supervised ViT pretraining, the approach offers a simple yet effective way to mitigate background interference, improve generalization to new classes, and reduce model complexity in few-shot scenarios.

Abstract

Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.

Class-relevant Patch Embedding Selection for Few-Shot Image Classification

TL;DR

This work tackles foreground-background interference in few-shot image classification by introducing Class-relevant Patch Embedding Selection (CPES). CPES uses a pre-trained Vision Transformer with Masked Image Modeling to extract a global class embedding and local patch embeddings, then selects class-relevant patches via their cosine similarity to the class embedding and fuses them with the class embedding for robust image representations. It computes a dense patch-wise similarity matrix between support and query representations and scores it with an MLP, all without adding extra learnable parameters for patch weighting. Extensive experiments on four standard benchmarks show CPES achieves strong performance, often surpassing state-of-the-art baselines, and extensions to existing methods confirm its flexibility and practical impact for few-shot learning. By focusing on semantically relevant patches and leveraging self-supervised ViT pretraining, the approach offers a simple yet effective way to mitigate background interference, improve generalization to new classes, and reduce model complexity in few-shot scenarios.

Abstract

Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.
Paper Structure (18 sections, 14 equations, 4 figures, 7 tables)

This paper contains 18 sections, 14 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration on the patch regions. The highlighted regions contain key semantics consistent with global information that corresponds to semantics of image labels, while low-transparency regions contain semantics not relevant to the global information.
  • Figure 2: The processing pipeline of CPES. Support and query images are patched and then encoded with a pre-trained ViT. Patch embeddings are compared with class embedding to select top relevant patches, which are then fused with class embeddings to create new embeddings. A similarity matrix is calculated based on these embeddings, which in the end is flattened and fed into a multi-layer perceptron to generate the similarity score.
  • Figure 3: Visualization of class-relevant patch embedding selection for four randomly sampled 5-way 1-shot classification tasks. (a), (b), (c), and (d) show the visualizations without CPES. (e), (f), (g), and (h) show the corresponding visualizations with CPES. CPES selects class-relevant patch embeddings by class embedding, thus eliminates class-irrelevant patch embeddings.
  • Figure 4: Illustration of the selected patch embeddings visualization of two randomly sampled 5-way 1-shot classification tasks with one query image per class. The selected class-relevant patches are retained while the class-irrelevant patches are masked. One can observe that the selected patches mainly focus on the focal region.