Table of Contents
Fetching ...

Generalizable Prompt Tuning for Vision-Language Models

Qian Zhang

TL;DR

This work tackles the challenge of achieving both strong task-specific performance and broad generalization in vision-language prompting. It introduces an MI-based Ensemble and Exploration framework (E2MIM) that treats hand-crafted and learnable prompts as dual textual views and maximizes mutual information to fuse their semantic information, augmented by class-wise visual Mixup to enrich the prompt space. A learnable MI estimator guides the integration of augmented visual features with textual prompts, producing better base-to-new generalization, domain generalization, and cross-dataset transfer across 11 benchmarks, while maintaining efficient training. The approach demonstrates a favorable trade-off between task adaptability and generalization, offering practical gains for open-world recognition with VLMs.

Abstract

Prompt tuning for vision-language models such as CLIP involves optimizing the text prompts used to generate image-text pairs for specific downstream tasks. While hand-crafted or template-based prompts are generally applicable to a wider range of unseen classes, they tend to perform poorly in downstream tasks (i.e., seen classes). Learnable soft prompts, on the other hand, often perform well in downstream tasks but lack generalizability. Additionally, prior research has predominantly concentrated on the textual modality, with very few studies attempting to explore the prompt's generalization potential from the visual modality. Keeping these limitations in mind, we investigate how to prompt tuning to obtain both a competitive downstream performance and generalization. The study shows that by treating soft and hand-crafted prompts as dual views of the textual modality, and maximizing their mutual information, we can better ensemble task-specific and general semantic information. Moreover, to generate more expressive prompts, the study introduces a class-wise augmentation from the visual modality, resulting in significant robustness to a wider range of unseen classes. Extensive evaluations on several benchmarks report that the proposed approach achieves competitive results in terms of both task-specific performance and general abilities.

Generalizable Prompt Tuning for Vision-Language Models

TL;DR

This work tackles the challenge of achieving both strong task-specific performance and broad generalization in vision-language prompting. It introduces an MI-based Ensemble and Exploration framework (E2MIM) that treats hand-crafted and learnable prompts as dual textual views and maximizes mutual information to fuse their semantic information, augmented by class-wise visual Mixup to enrich the prompt space. A learnable MI estimator guides the integration of augmented visual features with textual prompts, producing better base-to-new generalization, domain generalization, and cross-dataset transfer across 11 benchmarks, while maintaining efficient training. The approach demonstrates a favorable trade-off between task adaptability and generalization, offering practical gains for open-world recognition with VLMs.

Abstract

Prompt tuning for vision-language models such as CLIP involves optimizing the text prompts used to generate image-text pairs for specific downstream tasks. While hand-crafted or template-based prompts are generally applicable to a wider range of unseen classes, they tend to perform poorly in downstream tasks (i.e., seen classes). Learnable soft prompts, on the other hand, often perform well in downstream tasks but lack generalizability. Additionally, prior research has predominantly concentrated on the textual modality, with very few studies attempting to explore the prompt's generalization potential from the visual modality. Keeping these limitations in mind, we investigate how to prompt tuning to obtain both a competitive downstream performance and generalization. The study shows that by treating soft and hand-crafted prompts as dual views of the textual modality, and maximizing their mutual information, we can better ensemble task-specific and general semantic information. Moreover, to generate more expressive prompts, the study introduces a class-wise augmentation from the visual modality, resulting in significant robustness to a wider range of unseen classes. Extensive evaluations on several benchmarks report that the proposed approach achieves competitive results in terms of both task-specific performance and general abilities.
Paper Structure (18 sections, 14 equations, 2 figures, 8 tables)

This paper contains 18 sections, 14 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: The framework overview. We illustrate our method that is fine-tuned on two classes for the sake of simplicity. We start by fusing the image features from Class 1, Class 2, and Mixed Class 1$\&$2 with the text embeddings of hand-crafted and learnable prompts, respectively. This process allows us to create augmented dual views of the prediction probability. Next, the learnable MI estimator is utilized to ensemble the shared semantic cues from both general and specific information.
  • Figure 2: t-SNE visualization on Flowers102 flowers.