Class-relevant Patch Embedding Selection for Few-Shot Image Classification
Weihao Jiang, Haoyang Cui, Kun He
TL;DR
This work tackles foreground-background interference in few-shot image classification by introducing Class-relevant Patch Embedding Selection (CPES). CPES uses a pre-trained Vision Transformer with Masked Image Modeling to extract a global class embedding and local patch embeddings, then selects class-relevant patches via their cosine similarity to the class embedding and fuses them with the class embedding for robust image representations. It computes a dense patch-wise similarity matrix between support and query representations and scores it with an MLP, all without adding extra learnable parameters for patch weighting. Extensive experiments on four standard benchmarks show CPES achieves strong performance, often surpassing state-of-the-art baselines, and extensions to existing methods confirm its flexibility and practical impact for few-shot learning. By focusing on semantically relevant patches and leveraging self-supervised ViT pretraining, the approach offers a simple yet effective way to mitigate background interference, improve generalization to new classes, and reduce model complexity in few-shot scenarios.
Abstract
Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.
