Table of Contents
Fetching ...

Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models

Fei Song, Yi Li, Rui Wang, Jiahuan Zhou, Changwen Zheng, Jiangmeng Li

TL;DR

This work addresses the prompt optimization bias that arises when test-time prompt tuning relies solely on unlabeled test data. It introduces Doubly Debiased Test-Time Prompt Tuning (D2TPT), which combines a dynamic retrieval-augmented modulation (RAM) module with a reliability-aware prompt optimization (RPO) module to reduce bias from both model and data perspectives. RAM builds a dynamic knowledge base of high-confidence predictions to modulate outputs, while RPO applies a confidence-weighted ensemble and cross-modal consistency distillation to regularize prompt tuning. Across 15 datasets covering natural distribution shifts and cross-dataset generalization, D2TPT consistently outperforms strong baselines and improves cross-modal alignment, demonstrating robust, label-free tuning for vision-language models.

Abstract

Test-time prompt tuning for vision-language models has demonstrated impressive generalization capabilities under zero-shot settings. However, tuning the learnable prompts solely based on unlabeled test data may induce prompt optimization bias, ultimately leading to suboptimal performance on downstream tasks. In this work, we analyze the underlying causes of prompt optimization bias from both the model and data perspectives. In terms of the model, the entropy minimization objective typically focuses on reducing the entropy of model predictions while overlooking their correctness. This can result in overconfident yet incorrect outputs, thereby compromising the quality of prompt optimization. On the data side, prompts affected by optimization bias can introduce misalignment between visual and textual modalities, which further aggravates the prompt optimization bias. To this end, we propose a Doubly Debiased Test-Time Prompt Tuning method. Specifically, we first introduce a dynamic retrieval-augmented modulation module that retrieves high-confidence knowledge from a dynamic knowledge base using the test image feature as a query, and uses the retrieved knowledge to modulate the predictions. Guided by the refined predictions, we further develop a reliability-aware prompt optimization module that incorporates a confidence-based weighted ensemble and cross-modal consistency distillation to impose regularization constraints during prompt tuning. Extensive experiments across 15 benchmark datasets involving both natural distribution shifts and cross-datasets generalization demonstrate that our method outperforms baselines, validating its effectiveness in mitigating prompt optimization bias.

Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models

TL;DR

This work addresses the prompt optimization bias that arises when test-time prompt tuning relies solely on unlabeled test data. It introduces Doubly Debiased Test-Time Prompt Tuning (D2TPT), which combines a dynamic retrieval-augmented modulation (RAM) module with a reliability-aware prompt optimization (RPO) module to reduce bias from both model and data perspectives. RAM builds a dynamic knowledge base of high-confidence predictions to modulate outputs, while RPO applies a confidence-weighted ensemble and cross-modal consistency distillation to regularize prompt tuning. Across 15 datasets covering natural distribution shifts and cross-dataset generalization, D2TPT consistently outperforms strong baselines and improves cross-modal alignment, demonstrating robust, label-free tuning for vision-language models.

Abstract

Test-time prompt tuning for vision-language models has demonstrated impressive generalization capabilities under zero-shot settings. However, tuning the learnable prompts solely based on unlabeled test data may induce prompt optimization bias, ultimately leading to suboptimal performance on downstream tasks. In this work, we analyze the underlying causes of prompt optimization bias from both the model and data perspectives. In terms of the model, the entropy minimization objective typically focuses on reducing the entropy of model predictions while overlooking their correctness. This can result in overconfident yet incorrect outputs, thereby compromising the quality of prompt optimization. On the data side, prompts affected by optimization bias can introduce misalignment between visual and textual modalities, which further aggravates the prompt optimization bias. To this end, we propose a Doubly Debiased Test-Time Prompt Tuning method. Specifically, we first introduce a dynamic retrieval-augmented modulation module that retrieves high-confidence knowledge from a dynamic knowledge base using the test image feature as a query, and uses the retrieved knowledge to modulate the predictions. Guided by the refined predictions, we further develop a reliability-aware prompt optimization module that incorporates a confidence-based weighted ensemble and cross-modal consistency distillation to impose regularization constraints during prompt tuning. Extensive experiments across 15 benchmark datasets involving both natural distribution shifts and cross-datasets generalization demonstrate that our method outperforms baselines, validating its effectiveness in mitigating prompt optimization bias.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of the classical prompt tuning CoOp DBLP:journals/ijcv/ZhouYLL22 and test-time prompt tuning TPT DBLP:conf/nips/ShuNHYGAX22. CoOp uses a few labeled samples to optimize the learnable prompt via supervised classification loss, while TPT performs label-free optimization by minimizing the entropy of predictions.
  • Figure 2: (Top) Examples illustrating that entropy minimization can lead to overconfident predictions. For instance, although TPT's prediction for petunia has lower entropy and higher confidence, the prediction result is incorrect. (Bottom) Effect of the alignment strategy across different datasets. A, R, S, and V denote the abbreviations of the ImageNet-A, ImageNet-R, ImageNet-Sketch, and ImageNet-V2 datasets, respectively.
  • Figure 3: The overall framework of our D2TPT method.
  • Figure 4: Case study on prediction confidence and correctness. We use pred. to denote the model’s predicted category.
  • Figure 5: Comparison of normalized cross-modal feature distances across different models.
  • ...and 1 more figures