Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

Qian-Wei Wang; Yaguang Song; Shu-Tao Xia

Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

Qian-Wei Wang, Yaguang Song, Shu-Tao Xia

TL;DR

This work tackles the problem of adapting large vision–language models (CLIP) to downstream image classification under limited annotation budgets by integrating uncertainty modeling directly into the model. It introduces a dual-pPrompt framework in CLIP's textual branch—comprising a positive and a negative prompt—to estimate per-sample pseudo-label reliability via $p^{\text{clean}}_{\hat{y}}$ and guide both uncertainty-aware sample selection and confident pseudo-label mining, while using Visual Prompt Tuning (VPT) to adapt the visual encoder. A round-based active-learning loop reinitializes the model each round, ranks unlabeled samples within each predicted class, and selects uncertain examples for labeling and confident ones for pseudo-labeling, achieving robust performance gains across six datasets, three PEFT paradigms (CoOp, VPT, MaPLe), and two backbones (ViT-B/16 and ViT-L/14). The approach consistently outperforms strong AL baselines, illustrating the value of model-integrated uncertainty signals for efficient CLIP adaptation in practical low-label regimes.

Abstract

Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.

Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

TL;DR

and guide both uncertainty-aware sample selection and confident pseudo-label mining, while using Visual Prompt Tuning (VPT) to adapt the visual encoder. A round-based active-learning loop reinitializes the model each round, ranks unlabeled samples within each predicted class, and selects uncertain examples for labeling and confident ones for pseudo-labeling, achieving robust performance gains across six datasets, three PEFT paradigms (CoOp, VPT, MaPLe), and two backbones (ViT-B/16 and ViT-L/14). The approach consistently outperforms strong AL baselines, illustrating the value of model-integrated uncertainty signals for efficient CLIP adaptation in practical low-label regimes.

Abstract

Paper Structure (18 sections, 3 equations, 1 figure, 4 tables)

This paper contains 18 sections, 3 equations, 1 figure, 4 tables.

Introduction
Related Work
Pre-trained VLMs and Parameter-Efficient Fine-Tuning
Active Learning for VLMs
Methodology
Overview
Dual-Prompt-Based CLIP Adaptation Model
Textual Dual Prompt Learning
Parameter-Efficient Fine-Tuning of the Visual Encoder
Uncertainty-Driven AL Framework
Experiments
Experimental Setup
Datasets and Baselines
Implementation Details
Main Results
...and 3 more sections

Figures (1)

Figure 1: The overall illustration of our method.

Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

TL;DR

Abstract

Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (1)