SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Chung-En Johnny Yu; Brian Jalaian; Nathaniel D. Bastian

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

Abstract

Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework for multi-VLM systems through uncertainty-weighted linear opinion pooling. The core idea is to treat each VLM as a probabilistic "expert," sample multiple outputs, map them to a unified space, aggregate their opinions, and produce a system-level uncertainty score. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems. Our code is publicly available at https://github.com/chungenyu6/SCoOP.

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Abstract

Paper Structure (15 sections, 10 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 15 sections, 10 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
SCoOP Framework
Single-VLM's Probabilistic Opinions
Multi-VLM Opinions Aggregation
Experiment
Experiment Setup
Result Analysis
Research Questions and Ablation Study
Conclusion
Appendix
Additional Math Details of SCoOP
The Aggregation Baselines
Additional Experiments Details
Prompt Template

Figures (7)

Figure 1: Overview of SCoOP increasing the reliability of a multi-VLM system.
Figure 1: Comparison of E2E-Latency@p50 across varying system sizes.$L_{sec}$ denotes latency per sample (in seconds). $\Delta L_{\mu s}$ denotes latency difference relative to SCoOP (in microseconds). Negative values indicate faster than SCoOP.
Figure 2: SCoOP (Semantic-Consistent Opinion Pooling) workflow. For each VLM $M_k$, we sample $N$ responses, map them to the unified option set $\Theta$, and form a probability vector $\mathbf{p}_k$. Each model's uncertainty weight $w_k$ is calculated from its Shannon entropy. We then apply weighted linear opinion pooling $\mathbf{p}_{\text{agg }}=\sum_{k=1}^K w_k \mathbf{p}_k$ to obtain the system distribution, which can obtain the final response $\theta^*$ and the system-level uncertainty $H_{agg}$. It unifies heterogeneous VLMs' outputs and quantifies uncertainty of the multi-VLM system.
Figure 3: Hallucination detection (AUROC) and abstention (AURAC) performance on ScienceQA with 3-VLM systems of extra-large-parameter models. Higher values indicate better uncertainty quality and impacts. The green dashed line marks the average AUROC/AURAC of the three individual VLMs evaluated separately (without aggregation).
Figure 4: UQ performance across varying model parameter scales. Methods are evaluated in 3-VLM systems on ScienceQA. AUROC for hallucination detection; AURAC for abstention.
...and 2 more figures

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Abstract

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Authors

Abstract

Table of Contents

Figures (7)