Table of Contents
Fetching ...

The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, Md Rizwan Parvez

TL;DR

This work tackles uncertainty quantification in Vision-Language Models (VLMs) by applying conformal prediction to produce calibrated prediction sets with formal guarantees across six multimodal benchmarks. It systematically compares three scoring functions—Least Ambiguous Classifier (LAC), Adaptive Prediction Sets (APS), and Marginal Score (MS)—and introduces instruction-guided likelihood proxies to extend principled uncertainty evaluation to closed-source models. Key findings show that larger models offer better accuracy and tighter, more reliable uncertainty estimates, with uncertainty patterns varying by domain (e.g., ScienceQA vs MathVision) and by task complexity; task-adaptive scoring improves efficiency. The study establishes a practical, distribution-free framework for trustworthy multimodal AI, enabling safer deployment in high-stakes settings and guiding future research into adaptive calibration and closed-model uncertainty proxies.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 18 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. For closed-source models lacking token-level logprob access, we develop and validate instruction-guided likelihood proxies. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.

The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

TL;DR

This work tackles uncertainty quantification in Vision-Language Models (VLMs) by applying conformal prediction to produce calibrated prediction sets with formal guarantees across six multimodal benchmarks. It systematically compares three scoring functions—Least Ambiguous Classifier (LAC), Adaptive Prediction Sets (APS), and Marginal Score (MS)—and introduces instruction-guided likelihood proxies to extend principled uncertainty evaluation to closed-source models. Key findings show that larger models offer better accuracy and tighter, more reliable uncertainty estimates, with uncertainty patterns varying by domain (e.g., ScienceQA vs MathVision) and by task complexity; task-adaptive scoring improves efficiency. The study establishes a practical, distribution-free framework for trustworthy multimodal AI, enabling safer deployment in high-stakes settings and guiding future research into adaptive calibration and closed-model uncertainty proxies.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 18 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. For closed-source models lacking token-level logprob access, we develop and validate instruction-guided likelihood proxies. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.

Paper Structure

This paper contains 36 sections, 7 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Correlation between accuracy and set size across datasets. Higher-performing models produce more concentrated prediction sets.
  • Figure 2: Comparison of set sizes across VLMs and scoring functions. LAC scoring consistently produces the most compact prediction sets.
  • Figure 3: Comparative uncertainty profiles across all VLMs. Proprietary models like GPT-4o-mini achieve remarkably well-calibrated uncertainty estimates.
  • Figure 4: Relationship between model size, accuracy, and set size. Larger models exhibit both higher accuracy and smaller set sizes.
  • Figure 5: Uncertainty profiles for three model families - InternVL (1B, 2B, 8B), Qwen-VL (3B, 72B), and Gemma (4B, 12B, 27B). Smaller enclosed radar areas indicate better-calibrated uncertainty. Darker shades represent larger models within each family. Each family exhibits distinct scaling patterns across domains. Metrics include Accuracy (Acc.), Coverage for LAC/MS/APS (higher values closer to 90% are better), and Inverted Set Size for LAC/MS/APS (smaller set sizes are better, inverted for visualization).
  • ...and 4 more figures