Table of Contents
Fetching ...

CoAPT: Context Attribute words for Prompt Tuning

Gun Lee, Subin An, Sungyong Baik, Soochahn Lee

TL;DR

CoAPT addresses the challenge of aligning text and image embeddings in CLIP-based few/zero-shot classification by enriching prompt text with class attribute words as hard prompts integrated into soft-prompt tuning. The method adds a meta-network to generate input-specific feature biases, enabling adaptation of the combined image-text queries. Empirical results across multiple datasets and tasks show consistent improvements over strong baselines, with ablations highlighting the benefits of more attribute words, bias application on text embeddings, and visually grounded attribute words (e.g., GPT4-Vision). This work advances prompt-tuning by uniting hard prompts, soft prompts, and per-input adaptation to better exploit multimodal latent spaces in vision-language models.

Abstract

We propose a novel prompt tuning method called CoAPT(Context Attribute words in Prompt Tuning) for few/zero-shot image classification. The core motivation is that attributes are descriptive words with rich information about a given concept. Thus, we aim to enrich text queries of existing prompt tuning methods, improving alignment between text and image embeddings in CLIP embedding space. To do so, CoAPT integrates attribute words as additional prompts within learnable prompt tuning and can be easily incorporated into various existing prompt tuning methods. To facilitate the incorporation of attributes into text embeddings and the alignment with image embeddings, soft prompts are trained together with an additional meta-network that generates input-image-wise feature biases from the concatenated feature encodings of the image-text combined queries. Our experiments demonstrate that CoAPT leads to considerable improvements for existing baseline methods on several few/zero-shot image classification tasks, including base-to-novel generalization, cross-dataset transfer, and domain generalization. Our findings highlight the importance of combining hard and soft prompts and pave the way for future research on the interplay between text and image latent spaces in pre-trained models.

CoAPT: Context Attribute words for Prompt Tuning

TL;DR

CoAPT addresses the challenge of aligning text and image embeddings in CLIP-based few/zero-shot classification by enriching prompt text with class attribute words as hard prompts integrated into soft-prompt tuning. The method adds a meta-network to generate input-specific feature biases, enabling adaptation of the combined image-text queries. Empirical results across multiple datasets and tasks show consistent improvements over strong baselines, with ablations highlighting the benefits of more attribute words, bias application on text embeddings, and visually grounded attribute words (e.g., GPT4-Vision). This work advances prompt-tuning by uniting hard prompts, soft prompts, and per-input adaptation to better exploit multimodal latent spaces in vision-language models.

Abstract

We propose a novel prompt tuning method called CoAPT(Context Attribute words in Prompt Tuning) for few/zero-shot image classification. The core motivation is that attributes are descriptive words with rich information about a given concept. Thus, we aim to enrich text queries of existing prompt tuning methods, improving alignment between text and image embeddings in CLIP embedding space. To do so, CoAPT integrates attribute words as additional prompts within learnable prompt tuning and can be easily incorporated into various existing prompt tuning methods. To facilitate the incorporation of attributes into text embeddings and the alignment with image embeddings, soft prompts are trained together with an additional meta-network that generates input-image-wise feature biases from the concatenated feature encodings of the image-text combined queries. Our experiments demonstrate that CoAPT leads to considerable improvements for existing baseline methods on several few/zero-shot image classification tasks, including base-to-novel generalization, cross-dataset transfer, and domain generalization. Our findings highlight the importance of combining hard and soft prompts and pave the way for future research on the interplay between text and image latent spaces in pre-trained models.
Paper Structure (11 sections, 3 equations, 4 figures, 8 tables)

This paper contains 11 sections, 3 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparative overview of CoAPT. Existing soft prompt tuning methods, such as CoOp CoOp, do not fully utilize the text encoder input. The empty slots in the text query can be enhanced by integrating additional hard prompts. For the sample classes from the OxfordPets datasetoxfordpets, CoAPT achieves better classification accuracy for both base and new classes.
  • Figure 2: Overview of CoAPT method with baseline prompt learning model. The CoAPT method consists of two steps. First, attribute words are generated using a language model, which is a one-time process. Second, during prompt learning, these words are combined with the soft prompt and class token. Inputs generate queries processed by a Meta-network, adding a bias term to the text queries. The combined image-text queries are then used to maximize the score for the ground-truth class.
  • Figure 3: Ablative evaluation of base-to-novel generalization on number of context attribute words with CoAPT integrated into PromptSRC PromptSRC. Plots are accuracy of base classes (a), novel classes (b), and harmonic mean (c), averaged over 11 datasets.
  • Figure 4: Qualitative comparative analysis of attribute words from GPT4-Language and GPT4-Vision. Color denotes change in accuracy (scaled in range $[0,1]$) when replacing GPT4-Language generated word (x-axis) with that of GPT4-Vision (y-axis).