Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning
Qian-Wei Wang, Yaguang Song, Shu-Tao Xia
TL;DR
This work tackles the problem of adapting large vision–language models (CLIP) to downstream image classification under limited annotation budgets by integrating uncertainty modeling directly into the model. It introduces a dual-pPrompt framework in CLIP's textual branch—comprising a positive and a negative prompt—to estimate per-sample pseudo-label reliability via $p^{\text{clean}}_{\hat{y}}$ and guide both uncertainty-aware sample selection and confident pseudo-label mining, while using Visual Prompt Tuning (VPT) to adapt the visual encoder. A round-based active-learning loop reinitializes the model each round, ranks unlabeled samples within each predicted class, and selects uncertain examples for labeling and confident ones for pseudo-labeling, achieving robust performance gains across six datasets, three PEFT paradigms (CoOp, VPT, MaPLe), and two backbones (ViT-B/16 and ViT-L/14). The approach consistently outperforms strong AL baselines, illustrating the value of model-integrated uncertainty signals for efficient CLIP adaptation in practical low-label regimes.
Abstract
Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.
