Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction
Yuanchang Ye, Weiyan Wen
TL;DR
The paper tackles hallucination in vision-language VQA by applying Split Conformal Prediction to produce uncertainty-aware prediction sets with rigorous marginal coverage guarantees at a user-defined level $α$. It defines a nonconformity score and computes prediction sets by thresholding at the calibration-derived quantile $τ = Q_{1-α}$, ensuring distribution-free validity under exchangeability. The approach is model-agnostic and retraining-free, demonstrated on MMMU and ScienceQA across eight LVLMs, with prediction-set sizes tightening as $α$ decreases and robust performance across split-ratio variations. This yields a scalable, reliable mechanism to mitigate hallucinations in safety-critical multimodal AI deployments, enabling uncertainty-aware decision-making without distributional assumptions.
Abstract
This study addresses the critical challenge of hallucination mitigation in Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks through a Split Conformal Prediction (SCP) framework. While LVLMs excel in multi-modal reasoning, their outputs often exhibit hallucinated content with high confidence, posing risks in safety-critical applications. We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. By partitioning data into calibration and test sets, the framework computes nonconformity scores to construct prediction sets with statistical guarantees under user-defined risk levels ($α$). Key innovations include: (1) rigorous control of \textbf{marginal coverage} to ensure empirical error rates remain strictly below $α$; (2) dynamic adjustment of prediction set sizes inversely with $α$, filtering low-confidence outputs; (3) elimination of prior distribution assumptions and retraining requirements. Evaluations on benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces theoretical guarantees across all $α$ values. The framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
