Table of Contents
Fetching ...

Learning to Adapt Category Consistent Meta-Feature of CLIP for Few-Shot Classification

Jiaying Shi, Xuetong Xue, Shenghui Xu

TL;DR

This work tackles the limitation of CLIP-based few-shot learning that relies heavily on high-level semantic features by introducing MF-Adapter, which integrates category-consistent local representations. It constructs Meta-Feature Units (MF-Units) from multi-scale sliding windows on low-level CLIP features and trains a light-weight MF-Adapter to map meta-features to MF-Units, enabling robust knowledge transfer within categories. The final prediction combines local MF-Unit-based logits with CLIP's high-level and text-based logits, yielding state-of-the-art or competitive results across 11 datasets, with especially strong gains on fine-grained tasks. The approach demonstrates the value of exploiting local, low-level cues alongside global semantic signals to enhance few-shot generalization in vision-language models, with practical impact for robust zero-shot and few-shot image classification.

Abstract

The recent CLIP-based methods have shown promising zero-shot and few-shot performance on image classification tasks. Existing approaches such as CoOp and Tip-Adapter only focus on high-level visual features that are fully aligned with textual features representing the ``Summary" of the image. However, the goal of few-shot learning is to classify unseen images of the same category with few labeled samples. Especially, in contrast to high-level representations, local representations (LRs) at low-level are more consistent between seen and unseen samples. Based on this point, we propose the Meta-Feature Adaption method (MF-Adapter) that combines the complementary strengths of both LRs and high-level semantic representations. Specifically, we introduce the Meta-Feature Unit (MF-Unit), which is a simple yet effective local similarity metric to measure category-consistent local context in an inductive manner. Then we train an MF-Adapter to map image features to MF-Unit for adequately generalizing the intra-class knowledge between unseen images and the support set. Extensive experiments show that our proposed method is superior to the state-of-the-art CLIP downstream few-shot classification methods, even showing stronger performance on a set of challenging visual classification tasks.

Learning to Adapt Category Consistent Meta-Feature of CLIP for Few-Shot Classification

TL;DR

This work tackles the limitation of CLIP-based few-shot learning that relies heavily on high-level semantic features by introducing MF-Adapter, which integrates category-consistent local representations. It constructs Meta-Feature Units (MF-Units) from multi-scale sliding windows on low-level CLIP features and trains a light-weight MF-Adapter to map meta-features to MF-Units, enabling robust knowledge transfer within categories. The final prediction combines local MF-Unit-based logits with CLIP's high-level and text-based logits, yielding state-of-the-art or competitive results across 11 datasets, with especially strong gains on fine-grained tasks. The approach demonstrates the value of exploiting local, low-level cues alongside global semantic signals to enhance few-shot generalization in vision-language models, with practical impact for robust zero-shot and few-shot image classification.

Abstract

The recent CLIP-based methods have shown promising zero-shot and few-shot performance on image classification tasks. Existing approaches such as CoOp and Tip-Adapter only focus on high-level visual features that are fully aligned with textual features representing the ``Summary" of the image. However, the goal of few-shot learning is to classify unseen images of the same category with few labeled samples. Especially, in contrast to high-level representations, local representations (LRs) at low-level are more consistent between seen and unseen samples. Based on this point, we propose the Meta-Feature Adaption method (MF-Adapter) that combines the complementary strengths of both LRs and high-level semantic representations. Specifically, we introduce the Meta-Feature Unit (MF-Unit), which is a simple yet effective local similarity metric to measure category-consistent local context in an inductive manner. Then we train an MF-Adapter to map image features to MF-Unit for adequately generalizing the intra-class knowledge between unseen images and the support set. Extensive experiments show that our proposed method is superior to the state-of-the-art CLIP downstream few-shot classification methods, even showing stronger performance on a set of challenging visual classification tasks.
Paper Structure (13 sections, 5 equations, 3 figures, 3 tables)

This paper contains 13 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of Meta-Feature Adapter (MF-Adapter). On the local branch (red dash box), before training, the MF-Unit space on the support set is obtained from the meta-feature with inductive representation. In the training phase, the meta-feature is mapped into MF-Units using MF-Adapter. On the global branch, we extend the CLIP’s powerful high-level and text-level knowledge for the final prediction.
  • Figure 2: Main results of few-shot classification on 11 datasets. Here, our MF-Adapter is competitive with most compared current SoTA methods. For all shots, our average improvements are stable and significant.
  • Figure 3: Performance gain contributed from the proposed MF-Adapter, which is constructed by the 4-shot and 8-shot training set on 11 classification datasets.