Table of Contents
Fetching ...

Towards Calibrating Prompt Tuning of Vision-Language Models

Ashshak Sharifdeen, Fahad Shamshad, Muhammad Akhtar Munir, Abhishek Basu, Mohamed Insaf Ismithdeen, Jeyapriyan Jeyamohan, Chathurika Sewwandi Silva, Karthik Nandakumar, Muhammad Haris Khan

TL;DR

This work proposes a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization and reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes.

Abstract

Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes

Towards Calibrating Prompt Tuning of Vision-Language Models

TL;DR

This work proposes a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization and reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes.

Abstract

Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes
Paper Structure (21 sections, 4 equations, 9 figures, 16 tables)

This paper contains 21 sections, 4 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Expected Calibration Error (ECE) on 11 datasets with CoOpcoop shown as radar plots. Left: Base classes, our method (red) consistently yields lower ECE than competing approaches, with notable gains on DTD, EuroSat, and Food. Right: Novel classes, our method reduces miscalibration relative to vanilla CoOp (yellow) and outperforms DAC dac and ZS-Norm eccv, especially on Aircraft and Cars. The uniformly smaller footprint of our curve indicates superior calibration, supporting the effectiveness of the proposed dual-regularization approach in addressing both underconfidence on base classes and overconfidence on novel classes.
  • Figure 2: Dual miscalibration in prompt-tuned CLIP.Left (top row): Base classes are underconfident (accuracy exceeds confidence) that improves with our regularization terms. Left (bottom row): Novel classes exhibit overconfidence (confidence exceeds accuracy) that our method effectively mitigates. (Right) Inter-class margin variability vs. ECE shows a negative correlation for base classes and a positive correlation for novel classes, indicating that prompt tuning tightens margins on base classes and inflates them on novel classes, degrading reliability. These trends motivate our margin-stabilizing and moment-preserving regularizers.
  • Figure 3: Errors by confidence on novel classes. Higher error mass in high-confidence bins indicates overconfidence. Both Cross-Entropy and Cross-Entropy + Margin place more misclassified samples in high-confidence regions, whereas adding Text Moment-Matching to the Margin term shifts errors away from these bins, reducing overconfidence.
  • Figure 4: Performance with different numbers of shots and hard prompt styles.
  • Figure 5: Margin based Label Smoothing(MBLS) vs Mean-Variance Margin(Margin) Regularization. In base class reliability diagram (Observation 1), MBLS with Cross-Entropy (CE) shows underconfidence, while adding a Margin term alleviates this. For Observations 3 and 4, we train CE, CE + MBLS, and CE + Margin with MaPLe, compute the top-1 margin $m=z_y-\max_{j\neq y}z_j$, and plot the Empirical Cumulative Distribution Function (ECDF). The ECDF shows Margin yields fewer low-margin samples(Underconfident samples), MBLS trims large but leaves small ones, and CE lies in between. Box plots confirm this: Margin has the highest median ($\approx 5$) with a right-shifted IQR, MBLS caps extremes with a tight IQR, while CE retains a broad spread.
  • ...and 4 more figures