C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

Hee Suk Yoon; Eunseop Yoon; Joshua Tian Jin Tee; Mark Hasegawa-Johnson; Yingzhen Li; Chang D. Yoo

C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, Chang D. Yoo

TL;DR

This work tackles calibration in test-time prompt tuning for CLIP by uncovering that the choice of textual prompts heavily influences calibration, independent of accuracy. It introduces Average Text Feature Dispersion (ATFD) and demonstrates a strong negative correlation between ATFD and Expected Calibration Error (ECE). Building on this, Calibrated Test-time Prompt Tuning (C-TPT) jointly optimizes prompts to maximize ATFD during test-time, yielding better-calibrated predictions without requiring labeled data. Across 11 fine-grained datasets and natural distribution shifts with multiple CLIP architectures, C-TPT consistently reduces calibration error while preserving or enhancing accuracy, outperforming temperature-scaling baselines and revealing a practical, generalizable calibration strategy for vision-language models.

Abstract

In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.

C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

TL;DR

Abstract

Paper Structure (44 sections, 5 equations, 9 figures, 16 tables)

This paper contains 44 sections, 5 equations, 9 figures, 16 tables.

Introduction
Related Work
Prompt Tuning for Vision-Language Models
Calibration of Neural Network
Background and Problem Setup
Zero-Shot Classification using Vision-Language Model (CLIP)
Prompt Tuning for CLIP
Calibration Error and Metric
Revisiting the Calibration of CLIP models
Observation 1: Test-Time Prompt Tuning (TPT) Increases the Calibration Error.
Observation 2: Prompt Sensitivity in Calibration of CLIP Models.
Observation 3: Well-calibrated Prompts have High Text-Feature Dispersion.
C-TPT: Calibrated Test-Time Prompt Tuning
Correlation Between Calibration and Text Feature Dispersion
C-TPT: Calibrated Test-Time Prompt Tuning
...and 29 more sections

Figures (9)

Figure 1: Observations. (The plots are based on the CLIP-ViT-B/16 on the StanfordCars dataset.) (a) Observation 1 shows the Reliability Diagrams calibrationmodern of the prediction made with the hard prompt template ('an example of {class}') (left) and after applying TPT (right). The diagrams highlight the negative impact of TPT on calibration due to overconfident predictions. (b) Observation 2 demonstrates the varying calibration error (i.e., ECE), although similar accuracy, plotted using 80 different hard prompt templates. (c) Observation 3 features a t-SNE visualization of text feature clustering patterns of different prompts with similar accuracy but different ECE, suggesting that text feature dispersion has a strong relationship with the calibration error of CLIP.
Figure 2: Plot illustrating the correlation between ECE and ATFD for hard prompts that achieve accuracies within 3% of the highest accuracy observed for each dataset. A notable negative association is observed for CLIP-RN50 and CLIP-ViT-B/16 across different datasets, with Pearson correlation coefficients pearson averaging -0.70 and -0.76, respectively.
Figure 3: Illustration of the Calibrated Test-time Prompt Tuning (C-TPT) for zero-shot image classification using CLIP. C-TPT improves calibration by optimizing the prompt so that it maximizes the Average Text Feature Dispersion (ATFD) during test-time prompt tuning.
Figure 4: Comparison of calibration error between TPT, temperature-scaled TPT ($\text{TPT}_{\text{temp}}$), and the joint use of our proposed C-TPT (TPT+C-TPT). Results are based on CLIP-ViT-B/16.
Figure 5: t-SNE visualization of class-embedded textual representations for (a) Hard Prompts and (b) Tuned Prompts, utilizing the CLIP-RN50 model on the Caltech101 dataset. In both cases, each unique color signifies a distinct prompt in the Prompt Visualization (left) and a distinct class in the Class Visualization (right). The legends belong to the Prompt Visualization (left) for both cases.
...and 4 more figures

C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

TL;DR

Abstract

C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

Authors

TL;DR

Abstract

Table of Contents

Figures (9)