Table of Contents
Fetching ...

An Empirical Study Into What Matters for Calibrating Vision-Language Models

Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Gould, Tom Gedeon

TL;DR

This work systematically studies calibration of Vision-Language Models (VLMs) across architectures, datasets, and training strategies. It shows that simple post-hoc temperature scaling dramatically improves uncertainty estimates for VLMs under distribution shifts and label-set changes, often more than for non-VLM baselines, and that calibration can be achieved with surprisingly few examples. The authors demonstrate cross-label-set and cross-hierarchy calibration, data-efficient calibration (roughly 40–50 images sufficing), and the transferability of calibration across prompts. They also introduce Calibration-by-Synthesis, using synthetic, fine-grained labeled data to calibrate when labeled data is unavailable, broadening practical applicability. Overall, the findings suggest that calibrated VLMs can be reliably deployed in risk-sensitive settings with modest calibration cost and simple strategies.

Abstract

Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper understanding of their uncertainty estimation capabilities, a relatively uncharted area. In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. In particular, we analyze the uncertainty estimation performance of VLMs when calibrated in one domain, label set or hierarchy level, and tested in a different one. Our findings reveal that while VLMs are not inherently calibrated for uncertainty, temperature scaling significantly and consistently improves calibration, even across shifts in distribution and changes in label set. Moreover, VLMs can be calibrated with a very small set of examples. Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios.

An Empirical Study Into What Matters for Calibrating Vision-Language Models

TL;DR

This work systematically studies calibration of Vision-Language Models (VLMs) across architectures, datasets, and training strategies. It shows that simple post-hoc temperature scaling dramatically improves uncertainty estimates for VLMs under distribution shifts and label-set changes, often more than for non-VLM baselines, and that calibration can be achieved with surprisingly few examples. The authors demonstrate cross-label-set and cross-hierarchy calibration, data-efficient calibration (roughly 40–50 images sufficing), and the transferability of calibration across prompts. They also introduce Calibration-by-Synthesis, using synthetic, fine-grained labeled data to calibrate when labeled data is unavailable, broadening practical applicability. Overall, the findings suggest that calibrated VLMs can be reliably deployed in risk-sensitive settings with modest calibration cost and simple strategies.

Abstract

Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper understanding of their uncertainty estimation capabilities, a relatively uncharted area. In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. In particular, we analyze the uncertainty estimation performance of VLMs when calibrated in one domain, label set or hierarchy level, and tested in a different one. Our findings reveal that while VLMs are not inherently calibrated for uncertainty, temperature scaling significantly and consistently improves calibration, even across shifts in distribution and changes in label set. Moreover, VLMs can be calibrated with a very small set of examples. Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios.
Paper Structure (35 sections, 12 figures, 2 tables)

This paper contains 35 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Comparing the calibration performance of ImageNet-trained models and VLMs. We report the results on the in-distribution test set (ID-Test) and two out-of-distribution (OOD) test sets: ImageNet-R and ImageNet-S. We plot the expected calibration error (ECE) before and after temperature scaling for each model. The blue dots represent VLMs and the green crosses denote ImageNet-trained models. We observe that VLMs are well-calibrated by temperature scaling on both ID and OOD test sets.
  • Figure 2: Adaptability of VLMs to different calibration label sets.Left: Calibration error reduction. Here, we observe a significant decrease in the expected calibration error for VLMs following cross-label-set calibration, as opposed to when no calibration is applied. Right: Correlation between VLM prediction probability and classification accuracy. This graph illustrates the classification accuracy of VLMs on ImageNet-R and ImageNet-S against their average prediction probability, before and after calibration with CIFAR-10-Val or DomainNet-Real. Each point represents a model, with the dashed black line indicating perfect calibration (y=x). The data showcases a strong linear and rank correlation, even when models are calibrated on label sets different from the target, proving the effectiveness of cross-label-set calibration for VLMs.
  • Figure 3: Robustness of VLM calibration to label hierarchy levels. This figure presents box plots summarizing the calibration errors (ECEs) of VLMs calibrated with label hierarchies differing in granularity from the target dataset (ImageNet-S). The top row shows calibration at a coarser level, and the bottom row at a finer level. Despite not matching the calibration precision of same-level calibration, the minimal differences indicate the robustness of VLM calibration to label granularity.
  • Figure 4: Data-efficiency of VLM calibration across diverse datasets. This figure displays the ECE of VLMs as a function of the calibration set size across four datasets: ImageNet-V2-A, ImageNet-S, CINIC, and DomainNet. The green stars are the average ECE of calibrated models trained on the dataset. The blue solid and green dashed horizontal lines represent the average ECE before calibration of VLMs and non-VLMs, respectively. The ECE values, averaged over ten random seeds, plateau after including merely $40$–$50$ images in the calibration set, chosen at random, 10 times. The results closely approximate the error obtained using the full set. This trend is observed despite the high number of classes in DomainNet and ImageNet, where many classes may not be represented even in the calibration set. These results highlight the data-efficiency of VLM calibration.
  • Figure 5: Impact of the distance between calibration and target set distributions on VLM uncertainty estimates. The distance between the calibration dataset and target dataset is computed by Fréchet inception distance heusel2017gans. A green star indicates that the dataset for this column has the same label set as the target dataset. We find that the calibration error has a weak correlation with the FID between the calibration and target datasets, however the label set compatibility also plays a significant role.
  • ...and 7 more figures