CPL: Counterfactual Prompt Learning for Vision and Language Models
Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang
TL;DR
This work tackles the generalization gap in CLIP-style prompt tuning by introducing Counterfactual Prompt Learning (CPL), which constructs minimal non-spurious feature changes (counterfactuals) between semantically similar samples and optimizes prompts using both factual and counterfactual examples through contrastive learning. CPL employs a text-based negative sampling strategy (via BERTScore) to select challenging negatives, and builds counterfactual visual features with minimal perturbations to maximize discriminative signal, while freezing the vision and text encoders. The method also designs task-relevant prompts per downstream task (classification, image-text retrieval, VQA) and integrates a joint optimization objective combining cross-entropy with a contrastive loss, leading to improved performance on unseen classes across seven image datasets, and notable gains in image-text retrieval and VQA under few-shot settings. Overall, CPL demonstrates that counterfactual reasoning and contrastive learning can significantly enhance prompt representations for vision-language models, enabling more robust, data-efficient transfer to unseen concepts.
Abstract
Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.
