Table of Contents
Fetching ...

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low

TL;DR

UMPIRE is introduced, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features.

Abstract

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

TL;DR

UMPIRE is introduced, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features.

Abstract

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.
Paper Structure (51 sections, 8 theorems, 31 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 51 sections, 8 theorems, 31 equations, 13 figures, 11 tables, 1 algorithm.

Key Result

Proposition 1.1

For $V_t$ defined in eq:umpire, we have

Figures (13)

  • Figure 1: Schematic describing the UMPIRE framework.
  • Figure 2: Multimodal coherence \ref{['r:coherence']}: Decrease in AUROC, and ECE when image-input information is (1) corrupted with noise, (2) replaced with a black image, or (3) removed.
  • Figure 3: Performance of uncertainty metrics in blackbox settings across image-text QA datasets. Sem.Ent (D) indicates its discrete version, (Llava) indicates Llava as the white-box proxy model.
  • Figure 4: Efficiency analysis of UMPIRE compared to baselines. (a) Computational overhead (inference latency in seconds) versus uncertainty estimation performance (AUROC). UMPIRE (red star) achieves state-of-the-art performance with negligible overhead, avoiding the high computational cost associated with semantic equivalence checks in methods like Sem.Ent. (b) The effect of the number of generated responses $k$ on performance. UMPIRE consistently outperforms other methods across all sample sizes and converges to high accuracy even with few generations (e.g., $k=5$), demonstrating superior sample efficiency.
  • Figure 5: Sensitivity analysis of the hyperparameter $\alpha$ in UMPIRE on the AdVQA dataset. The plots illustrate the effect of the incoherence score weight on Combined Score, AUROC, ECE, and CPC. The value $\alpha=0$ corresponds to the unadjusted semantic volume. The green star ($\star$) marks the optimal $\alpha$ found by maximizing the Combined Score, while the red dot ($\bullet$) indicates the value selected by our adaptive strategy. Notably, the adaptive approach yields performance close to the optimum without requiring labeled data for tuning.
  • ...and 8 more figures

Theorems & Definitions (16)

  • Proposition 1.1: UMPIRE decomposition
  • proof
  • Lemma 1.2: Second moment form of semantic volume
  • proof
  • Definition 1.3: Quadratic entropy
  • Lemma 1.4: Incoherence term as Monte Carlo estimate of $H_2$
  • proof
  • Lemma 1.5: Coarsening decreases quadratic entropy
  • proof
  • Corollary 1.6: Dominant-mode lower bound
  • ...and 6 more