FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs
Debarpan Bhattacharya, Apoorva Kulkarni, Sriram Ganapathy
TL;DR
FESTA introduces a black-box uncertainty estimator for multimodal LLMs by generating functionally equivalent samples (FES) to gauge consistency and functionally complementary samples (FCS) to probe sensitivity. The method computes uncertainty as KL-distance-based scores from ideal consistent and sensitive models, combining U_{FES} and U_{FCS} to form the FESTA score, which improves selective prediction AUROC on vision and audio MCQA tasks without ground-truth labels. It demonstrates strong, model- and dataset-agnostic performance with open-source implementations, highlighting robustness against low-uncertainty hallucinations. The work lays a foundation for uncertainty-aware abstention in multimodal reasoning and points to extensions toward open-ended generation and generation-quality concerns.
Abstract
The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
