Table of Contents
Fetching ...

VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

Chenyu Wang, Tianle Chen, H. M. Sabbir Ahmad, Kayhan Batmanghelich, Wenchao Li

TL;DR

VLM-UQBench targets the gap in uncertainty quantification for vision–language models by interleaving modality-specific and cross-modal uncertainty evaluation. It combines human-annotated VizWiz-based subsets, grounded cross-modal ambiguity from VQ-FocusAmbiguity, and a CLEVR-hallucination dataset with a scalable synthetic perturbation pipeline. The benchmark introduces two metrics, Uncertainty Reflection Rate (URR) and Hallucination Consistency Coefficient (HCC), and evaluates nine UQ methods across four VLMs, revealing strong modality specialization and model dependence, as well as a weak link between UQ signals and hallucination risk. This framework provides a scalable, modular platform to develop and validate more robust, modality-aware UQ methods for reliable VLM deployment.

Abstract

Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.

VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

TL;DR

VLM-UQBench targets the gap in uncertainty quantification for vision–language models by interleaving modality-specific and cross-modal uncertainty evaluation. It combines human-annotated VizWiz-based subsets, grounded cross-modal ambiguity from VQ-FocusAmbiguity, and a CLEVR-hallucination dataset with a scalable synthetic perturbation pipeline. The benchmark introduces two metrics, Uncertainty Reflection Rate (URR) and Hallucination Consistency Coefficient (HCC), and evaluates nine UQ methods across four VLMs, revealing strong modality specialization and model dependence, as well as a weak link between UQ signals and hallucination risk. This framework provides a scalable, modular platform to develop and validate more robust, modality-aware UQ methods for reliable VLM deployment.

Abstract

Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.
Paper Structure (52 sections, 18 equations, 25 figures, 5 tables, 1 algorithm)

This paper contains 52 sections, 18 equations, 25 figures, 5 tables, 1 algorithm.

Figures (25)

  • Figure 1: Benchmark composition and curation workflow. We integrate diverse data sources through a three-stage pipeline of human annotation, rule-based filtering, and expert curation (upper left), yielding four VizWiz-based subsets with modality-specific uncertainty labels (Clean, Image, Text, Cross; top right). VQ-FocusAmbiguity is leveraged as a grounded case of cross-modal ambiguity, providing unambiguous vs. ambiguous text--image alignment examples (middle right). Hallucination-focused subsets are built from CLEVR scene graphs using compositional templates and rule-based generation, enabling controlled evaluation of attribute, existence, counting, and relation hallucinations (bottom left).
  • Figure 2: Overview of our pipeline. A clean image–question pair is processed through three stages: uncertainty perturbation, variant generation, and uncertainty calculation. In the perturbation stage (left), we inject visual, textual, and cross-modal uncertainty (e.g., blur, brightness, and occlusion for images; subjective or invalid rewrites, typos, and shuffles for text; AMB- and IVE-based edits for cross-modality). Perturbation intensity is calibrated on small validation subsets of the target datasets for VLMs to avoid being too weak (no effect) or too strong (catastrophic failure); we manually select these levels using a visualization tool, as detailed in Appendix B.4. In the variant-generation stage (middle), each original clean pair is expanded into a set of perturbed counterparts. In the calculation stage (right), all variants are fed into VLMs, from which we collect intermediate statistics (such as token probabilities, entropy, and sampled generations). Different UQ methods use these statistics to compute uncertainty scores, which are evaluated with AUROC, F1, URR, and HCC to study modality sensitivity and the relationship between uncertainty and hallucinations.
  • Figure 3: HCC($\Delta U$) on CLEVR‑Existence across models (LLaVA, Qwen, GPT‑4o‑mini). Rows list UQ estimators; columns denote modality‑specific perturbations (visual in blue, textual in orange). Cells show the point‑biserial correlation between hallucination flips and uncertainty change—higher (warmer) is better; negative values share the zero color. Only GPT-4o-mini displays the colorbar; for this model we omit PMI, PTrue, and MeanTokenEntropy because its closed-source API does not expose full token-level distributions.
  • Figure 4: Calibration Interface (Too Weak). Our interactive tool allows for batch visualization of perturbation intensities. In this example, the selected intensity is insufficient to induce meaningful uncertainty, representing a trivial case where the perturbation has no effect.
  • Figure 5: Calibration Interface (Too Strong). An example of catastrophic semantic destruction. At extreme intensities (e.g., blur scale 19.00), the image content becomes unrecognizable, leading the model to refuse to answer and rendering UQ metrics uninformative.
  • ...and 20 more figures