Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems
Aakriti Agrawal, Rohith Aralikatti, Anirudh Satheesh, Souradip Chakraborty, Amrit Singh Bedi, Furong Huang
TL;DR
This work tackles the problem of selecting the best answer across multiple heterogeneous LLMs without relying on costly external verifiers or extensive sampling. It introduces a calibrated log-likelihood framework that aggregates per-token probabilities across models, using a teacher-forcing based single forward pass to compute a model-agnostic score and thereby identify the most confident, and likely correct, response. A generalization to incorporate any uncertainty measure via $M_c(C\mid p) = \frac{1}{N} \sum_{i=1}^N M_i(C\mid p)$ makes the method flexible for various uncertainty signals. Empirically, the approach yields notable improvements on GSM8K, MMLU, and ARC datasets in both debate and best-of-N settings, outperforming random tie-breaking and matching or surpassing single-model baselines under equal call counts. Overall, the calibrated, ensemble-based scoring enables efficient and reliable multi-LLM reasoning by leveraging model diversity without external judges or heavy sampling.
Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.
