An Empirical Study Into What Matters for Calibrating Vision-Language Models
Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Gould, Tom Gedeon
TL;DR
This work systematically studies calibration of Vision-Language Models (VLMs) across architectures, datasets, and training strategies. It shows that simple post-hoc temperature scaling dramatically improves uncertainty estimates for VLMs under distribution shifts and label-set changes, often more than for non-VLM baselines, and that calibration can be achieved with surprisingly few examples. The authors demonstrate cross-label-set and cross-hierarchy calibration, data-efficient calibration (roughly 40–50 images sufficing), and the transferability of calibration across prompts. They also introduce Calibration-by-Synthesis, using synthetic, fine-grained labeled data to calibrate when labeled data is unavailable, broadening practical applicability. Overall, the findings suggest that calibrated VLMs can be reliably deployed in risk-sensitive settings with modest calibration cost and simple strategies.
Abstract
Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper understanding of their uncertainty estimation capabilities, a relatively uncharted area. In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. In particular, we analyze the uncertainty estimation performance of VLMs when calibrated in one domain, label set or hierarchy level, and tested in a different one. Our findings reveal that while VLMs are not inherently calibrated for uncertainty, temperature scaling significantly and consistently improves calibration, even across shifts in distribution and changes in label set. Moreover, VLMs can be calibrated with a very small set of examples. Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios.
