Table of Contents
Fetching ...

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Yayuan Li, Jintao Guo, Lei Qi, Wenbin Li, Yinghuan Shi

TL;DR

This work tackles the limitations of training-free CLIP-based few-shot classification by addressing two bottlenecks: anomalous image matches and uneven text prompt quality. It introduces TIMO, a mutual guidance framework with Text-Guided-Image (TGI) and Image-Guided-Text (IGT) components, enabling cross-modal interaction without training. By ensembling TGI and IGT into TIMO (and its enhanced TIMO-S variant with grid-searched hyperparameters), the approach achieves state-of-the-art results among training-free methods and even surpasses some training-required methods, while significantly reducing time costs. The findings highlight the value of explicit cross-modal guidance and robust prompt integration, with potential extensions to other multimodal problems such as image retrieval.

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost. Our code is available at https://github.com/lyymuwu/TIMO.

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

TL;DR

This work tackles the limitations of training-free CLIP-based few-shot classification by addressing two bottlenecks: anomalous image matches and uneven text prompt quality. It introduces TIMO, a mutual guidance framework with Text-Guided-Image (TGI) and Image-Guided-Text (IGT) components, enabling cross-modal interaction without training. By ensembling TGI and IGT into TIMO (and its enhanced TIMO-S variant with grid-searched hyperparameters), the approach achieves state-of-the-art results among training-free methods and even surpasses some training-required methods, while significantly reducing time costs. The findings highlight the value of explicit cross-modal guidance and robust prompt integration, with potential extensions to other multimodal problems such as image retrieval.

Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost. Our code is available at https://github.com/lyymuwu/TIMO.

Paper Structure

This paper contains 28 sections, 14 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Defects of independent modelling. The top of the figure illustrates how the anomalous match left from the CLIP pretraining stage results in high similarity between images from distinct categories; The bottom shows the varying quality of the automatically generated prompts, where poor prompts can negatively impact the final performance.
  • Figure 2: Comparison of the average Top-1 accuracy of various FSL methods across 11 datasets. TIMO-S outperforms the previous best method by an average of 1.76%.
  • Figure 3: Comparison between existing CLIP-based few-shot methods and ours. (a) indicates the general architecture of previous work. (b) shows how text-image mutual guidance is applied in this work to interact deeply with different modalities.
  • Figure 4: Domain Generalization Performance (%) of TIMO and GDA-CLIP.
  • Figure 5: Ablation of prompt selection.
  • ...and 6 more figures