Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Gregory Kang Ruey Lau; Hieu Dao; Nicole Kan Hui Lin; Bryan Kian Hsiang Low

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low

TL;DR

UMPIRE is introduced, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features.

Abstract

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

TL;DR

Abstract

Paper Structure (51 sections, 8 theorems, 31 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 51 sections, 8 theorems, 31 equations, 13 figures, 11 tables, 1 algorithm.

Introduction
Problem formulation and desiderata
Method
Quality-diversity kernel and semantic volume
Analysis and practical considerations
Experimental results
\ref{['r:classify']}: Discrimination
\ref{['r:r2a']}, \ref{['r:r2b']}: Risk-score quality
\ref{['r:generalizability']},\ref{['r:coherence']},\ref{['r:efficiency']}: Design desiderata
Practical applications
Related Works
Conclusion
Theoretical analysis and intuition
Notation and problem setup
UMPIRE metric decomposition.
...and 36 more sections

Key Result

Proposition 1.1

For $V_t$ defined in eq:umpire, we have

Figures (13)

Figure 1: Schematic describing the UMPIRE framework.
Figure 2: Multimodal coherence \ref{['r:coherence']}: Decrease in AUROC, and ECE when image-input information is (1) corrupted with noise, (2) replaced with a black image, or (3) removed.
Figure 3: Performance of uncertainty metrics in blackbox settings across image-text QA datasets. Sem.Ent (D) indicates its discrete version, (Llava) indicates Llava as the white-box proxy model.
Figure 4: Efficiency analysis of UMPIRE compared to baselines. (a) Computational overhead (inference latency in seconds) versus uncertainty estimation performance (AUROC). UMPIRE (red star) achieves state-of-the-art performance with negligible overhead, avoiding the high computational cost associated with semantic equivalence checks in methods like Sem.Ent. (b) The effect of the number of generated responses $k$ on performance. UMPIRE consistently outperforms other methods across all sample sizes and converges to high accuracy even with few generations (e.g., $k=5$), demonstrating superior sample efficiency.
Figure 5: Sensitivity analysis of the hyperparameter $\alpha$ in UMPIRE on the AdVQA dataset. The plots illustrate the effect of the incoherence score weight on Combined Score, AUROC, ECE, and CPC. The value $\alpha=0$ corresponds to the unadjusted semantic volume. The green star ($\star$) marks the optimal $\alpha$ found by maximizing the Combined Score, while the red dot ($\bullet$) indicates the value selected by our adaptive strategy. Notably, the adaptive approach yields performance close to the optimum without requiring labeled data for tuning.
...and 8 more figures

Theorems & Definitions (16)

Proposition 1.1: UMPIRE decomposition
proof
Lemma 1.2: Second moment form of semantic volume
proof
Definition 1.3: Quadratic entropy
Lemma 1.4: Incoherence term as Monte Carlo estimate of $H_2$
proof
Lemma 1.5: Coarsening decreases quadratic entropy
proof
Corollary 1.6: Dominant-mode lower bound
...and 6 more

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

TL;DR

Abstract

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (16)