Table of Contents
Fetching ...

IDEA: Image Description Enhanced CLIP-Adapter

Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang

TL;DR

An Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks and introduces Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets.

Abstract

CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at https://github.com/FourierAI/IDEA.

IDEA: Image Description Enhanced CLIP-Adapter

TL;DR

An Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks and introduces Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets.

Abstract

CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at https://github.com/FourierAI/IDEA.
Paper Structure (17 sections, 7 equations, 6 figures, 3 tables)

This paper contains 17 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of different PEFTs for CLIP. CoOp coop and CoCoOp cocoop have a similar architecture. Tip-Adapter tip shares the same architecture as CLIP-Adapter clip-adapter. Different from previous works, (T)-IDEA introduces a multimodal adapter that explores the complementary relationship and semantic correlation among image-text pairs.
  • Figure 2: The architecture of IDEA and T-IDEA. Given a training set with $K$-shot and $N$-class, CLIP encodes visual and textual data to obtain $\mathbf{I}_{\text{train}}$ and $\mathbf{T}_{\text{train}}$, respectively. Then, we compute and convert the instance-level similarity into class-level similarity as few-shot knowledge. Additionally, we design a trainable projector $\mathbf{W}_{\text{proj}}$ and a learnable latent space $\mathbf{E}_{\text{bias}}$ to improve performance. Finally, we combine the few-shot knowledge with the original zero-shot knowledge to get the model logits.
  • Figure 3: Pytorch style pseudocode for IDEA
  • Figure 4: Pipeline of generating image description.
  • Figure 5: Examples of the image description generated by the Llama model.
  • ...and 1 more figures