Table of Contents
Fetching ...

CPL: Counterfactual Prompt Learning for Vision and Language Models

Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang

TL;DR

This work tackles the generalization gap in CLIP-style prompt tuning by introducing Counterfactual Prompt Learning (CPL), which constructs minimal non-spurious feature changes (counterfactuals) between semantically similar samples and optimizes prompts using both factual and counterfactual examples through contrastive learning. CPL employs a text-based negative sampling strategy (via BERTScore) to select challenging negatives, and builds counterfactual visual features with minimal perturbations to maximize discriminative signal, while freezing the vision and text encoders. The method also designs task-relevant prompts per downstream task (classification, image-text retrieval, VQA) and integrates a joint optimization objective combining cross-entropy with a contrastive loss, leading to improved performance on unseen classes across seven image datasets, and notable gains in image-text retrieval and VQA under few-shot settings. Overall, CPL demonstrates that counterfactual reasoning and contrastive learning can significantly enhance prompt representations for vision-language models, enabling more robust, data-efficient transfer to unseen concepts.

Abstract

Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.

CPL: Counterfactual Prompt Learning for Vision and Language Models

TL;DR

This work tackles the generalization gap in CLIP-style prompt tuning by introducing Counterfactual Prompt Learning (CPL), which constructs minimal non-spurious feature changes (counterfactuals) between semantically similar samples and optimizes prompts using both factual and counterfactual examples through contrastive learning. CPL employs a text-based negative sampling strategy (via BERTScore) to select challenging negatives, and builds counterfactual visual features with minimal perturbations to maximize discriminative signal, while freezing the vision and text encoders. The method also designs task-relevant prompts per downstream task (classification, image-text retrieval, VQA) and integrates a joint optimization objective combining cross-entropy with a contrastive loss, leading to improved performance on unseen classes across seven image datasets, and notable gains in image-text retrieval and VQA under few-shot settings. Overall, CPL demonstrates that counterfactual reasoning and contrastive learning can significantly enhance prompt representations for vision-language models, enabling more robust, data-efficient transfer to unseen concepts.

Abstract

Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.
Paper Structure (34 sections, 8 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 34 sections, 8 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: A conceptual overview of counterfactual prompt learning. CPL constructs counterfactuals by identifying non-spurious feature change that causally causes the prompt change. In this case, the "barn" feature is the essential cause between Prompt A and B.
  • Figure 2: The counterfactual prompt learning framework. We freeze the vision encoder $F$ and the text encoder $G$, and only optimize the task-agnostic prompts and the instance-conditioned net $M$ (blue blocks). Please refer to Section \ref{['sec:overview']} for the explanation.
  • Figure 3: Counterfactual generation process. $\boldsymbol{v}$ and $c$ are the positive image feature and label, while $\boldsymbol{v}^-$ and $c^-$ are the negative image feature and label. $\circ$ is element-wise multiplication. By mixing $\boldsymbol{v}$ and $\boldsymbol{v}^-$, the counterfactual image feature $\boldsymbol{v'}$ is predicted as a negative label $c^-$ by the discriminator $D$. $\mathbf{u}$ is minimized so a minimal change to the positive image feature $\mathbf{u}$ is captured here to causally change the label.
  • Figure 4: Visualization of the weights of the controller parameter $\mathbf{u}$ on images. The first column is the original positive examples; the second column is BERT-sampled negative examples; the third column is randomly-sampled negative examples for comparison. The BERTScore between the text prompts of positive examples and sampled examples are shown at the bottom.
  • Figure 5: Accuracy comparison on ImageNet imagenet unseen classes under three different shots. CPL performs better than CoCoOp consistently and has lower standard errors.
  • ...and 4 more figures