Table of Contents
Fetching ...

FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy

TL;DR

FESTA introduces a black-box uncertainty estimator for multimodal LLMs by generating functionally equivalent samples (FES) to gauge consistency and functionally complementary samples (FCS) to probe sensitivity. The method computes uncertainty as KL-distance-based scores from ideal consistent and sensitive models, combining U_{FES} and U_{FCS} to form the FESTA score, which improves selective prediction AUROC on vision and audio MCQA tasks without ground-truth labels. It demonstrates strong, model- and dataset-agnostic performance with open-source implementations, highlighting robustness against low-uncertainty hallucinations. The work lays a foundation for uncertainty-aware abstention in multimodal reasoning and points to extensions toward open-ended generation and generation-quality concerns.

Abstract

The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.

FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs

TL;DR

FESTA introduces a black-box uncertainty estimator for multimodal LLMs by generating functionally equivalent samples (FES) to gauge consistency and functionally complementary samples (FCS) to probe sensitivity. The method computes uncertainty as KL-distance-based scores from ideal consistent and sensitive models, combining U_{FES} and U_{FCS} to form the FESTA score, which improves selective prediction AUROC on vision and audio MCQA tasks without ground-truth labels. It demonstrates strong, model- and dataset-agnostic performance with open-source implementations, highlighting robustness against low-uncertainty hallucinations. The work lays a foundation for uncertainty-aware abstention in multimodal reasoning and points to extensions toward open-ended generation and generation-quality concerns.

Abstract

The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.

Paper Structure

This paper contains 42 sections, 6 theorems, 30 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Proposition 3.1

The $U_{FES}(M|{\mathbf x})$ simplifies to:

Figures (5)

  • Figure 1: An example of multi-modal reasoning input (top-panel). An equivalent sample (middle panel) with same gray-scale image and rephrased prompt question is expected to keep the MLLM prediction unchanged, whereas a complementary input sample (bottom panel) is expected to alter the prediction. The proposed FESTA uses equivalent and complementary samples to generate the uncertainty measure.
  • Figure 2: Schematic illustration of the proposed FESTA uncertainty quantification approach. Given a multimodal MCQ input, we generate functional equivalent samples (FES) and functional complementary samples (FCS). We compute divergence of model predictive uncertainty from an ideally consistent model (for FES) and an ideally sensitive model (for FCS), and then combine these measures to generate the FESTA uncertainty score.
  • Figure 5: FESTA log(score) plots for best improvement models where score is reciprocal of FESTA uncertainty.
  • Figure 6: FESTA log(score) plots for output sampling baseline where score is reciprocal of FESTA uncertainty.
  • Figure 7: Examples of Functionally Equivalent Transform and Functionally Complementary Transform for both audio-text and image-text questions.

Theorems & Definitions (11)

  • Proposition 3.1
  • proof
  • Proposition 3.2
  • proof
  • Proposition A.1
  • proof
  • Proposition A.2
  • proof
  • Proposition A.3
  • proof
  • ...and 1 more