Table of Contents
Fetching ...

Open-Vocabulary Calibration for Fine-tuned CLIP

Shuoyuan Wang, Jindong Wang, Guoqing Wang, Bob Zhang, Kaiyang Zhou, Hongxin Wei

TL;DR

This work investigates the reliability of fine-tuned vision-language models in open-vocabulary settings and uncovers a persistent miscalibration: base classes tend to be underconfident while novel classes are overconfident. To address this, it proposes Distance-Aware Calibration (DAC), a post-hoc method that scales prediction confidence based on a textual deviation score computed from base and novel class text embeddings, effectively adapting the temperature without extra cost. DAC is designed to be plug-and-play with existing prompt-tuning methods and shows consistent reductions in expected calibration error (ECE) across 11 datasets and 7 tuning methods, with substantial improvements on high-confidence predictions. The approach also demonstrates robustness to the choice of neighbors in the textual proximity computation and extends to full fine-tuning scenarios, indicating strong practical potential for reliable open-vocabulary recognition in real-world deployments.

Abstract

Vision-language models (VLMs) have emerged as formidable tools, showing their strong capability in handling various open-vocabulary tasks in image recognition, text-driven visual content generation, and visual chatbots, to name a few. In recent years, considerable efforts and resources have been devoted to adaptation methods for improving downstream performance of VLMs, particularly on parameter-efficient fine-tuning methods like prompt learning. However, a crucial aspect that has been largely overlooked is the confidence calibration problem in fine-tuned VLMs, which could greatly reduce reliability when deploying such models in the real world. This paper bridges the gap by systematically investigating the confidence calibration problem in the context of prompt learning and reveals that existing calibration methods are insufficient to address the problem, especially in the open-vocabulary setting. To solve the problem, we present a simple and effective approach called Distance-Aware Calibration (DAC), which is based on scaling the temperature using as guidance the distance between predicted text labels and base classes. The experiments with 7 distinct prompt learning methods applied across 11 diverse downstream datasets demonstrate the effectiveness of DAC, which achieves high efficacy without sacrificing the inference speed. Our code is available at https://github.com/ml-stat-Sustech/CLIP_Calibration.

Open-Vocabulary Calibration for Fine-tuned CLIP

TL;DR

This work investigates the reliability of fine-tuned vision-language models in open-vocabulary settings and uncovers a persistent miscalibration: base classes tend to be underconfident while novel classes are overconfident. To address this, it proposes Distance-Aware Calibration (DAC), a post-hoc method that scales prediction confidence based on a textual deviation score computed from base and novel class text embeddings, effectively adapting the temperature without extra cost. DAC is designed to be plug-and-play with existing prompt-tuning methods and shows consistent reductions in expected calibration error (ECE) across 11 datasets and 7 tuning methods, with substantial improvements on high-confidence predictions. The approach also demonstrates robustness to the choice of neighbors in the textual proximity computation and extends to full fine-tuning scenarios, indicating strong practical potential for reliable open-vocabulary recognition in real-world deployments.

Abstract

Vision-language models (VLMs) have emerged as formidable tools, showing their strong capability in handling various open-vocabulary tasks in image recognition, text-driven visual content generation, and visual chatbots, to name a few. In recent years, considerable efforts and resources have been devoted to adaptation methods for improving downstream performance of VLMs, particularly on parameter-efficient fine-tuning methods like prompt learning. However, a crucial aspect that has been largely overlooked is the confidence calibration problem in fine-tuned VLMs, which could greatly reduce reliability when deploying such models in the real world. This paper bridges the gap by systematically investigating the confidence calibration problem in the context of prompt learning and reveals that existing calibration methods are insufficient to address the problem, especially in the open-vocabulary setting. To solve the problem, we present a simple and effective approach called Distance-Aware Calibration (DAC), which is based on scaling the temperature using as guidance the distance between predicted text labels and base classes. The experiments with 7 distinct prompt learning methods applied across 11 diverse downstream datasets demonstrate the effectiveness of DAC, which achieves high efficacy without sacrificing the inference speed. Our code is available at https://github.com/ml-stat-Sustech/CLIP_Calibration.
Paper Structure (44 sections, 9 equations, 10 figures, 18 tables)

This paper contains 44 sections, 9 equations, 10 figures, 18 tables.

Figures (10)

  • Figure 1: Reliability of fine-tuned CLIP (ViT-B/16) on the Flower102 dataset. ECE: Expected Calibration Error (lower is better). Miscalibration is depicted in pink for overconfidence and purple for underconfidence.
  • Figure 2: Paired inputs from image ($x$) / text ($w$) are sampled from the DTD dataset fed into zero-shot / tuned CLIP and are visualized in 2D using SVD. Compared with zero-shot CLIP, CoOp has a larger textual distribution gap between the base and new classes
  • Figure 3: Class-wise performance on StanfordCars dataset after tuning. $\text{ECE}^{*}$ with a positive (negative) value denotes overconfidence (underconfidence). The scatters represent the origin results and the broken line denotes the bin-based results. Confidence and $\text{ECE}^{*}$ increase as proximity decreases. Temperature scaling (TS) can not mitigate the overconfidence.
  • Figure 4: $\text{ECE}^{*}$ (%) performance with difference calibration methods. $\text{ECE}^{*}$ with a positive (negative) value denotes overconfidence (underconfidence). Our proposed DAC largely mitigates the overconfidence in the predicted classes with lower TD score.
  • Figure 5: Hyperparameter sensitivity of the number of neighbors used in computing textual proximity. The miscalibration will be noticeably mitigated if $K>1$.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 4.1: Proximity xiong2023proximity