VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models
Chenyu Wang, Tianle Chen, H. M. Sabbir Ahmad, Kayhan Batmanghelich, Wenchao Li
TL;DR
VLM-UQBench targets the gap in uncertainty quantification for vision–language models by interleaving modality-specific and cross-modal uncertainty evaluation. It combines human-annotated VizWiz-based subsets, grounded cross-modal ambiguity from VQ-FocusAmbiguity, and a CLEVR-hallucination dataset with a scalable synthetic perturbation pipeline. The benchmark introduces two metrics, Uncertainty Reflection Rate (URR) and Hallucination Consistency Coefficient (HCC), and evaluates nine UQ methods across four VLMs, revealing strong modality specialization and model dependence, as well as a weak link between UQ signals and hallucination risk. This framework provides a scalable, modular platform to develop and validate more robust, modality-aware UQ methods for reliable VLM deployment.
Abstract
Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.
