Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection
Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu
TL;DR
This work addresses budgeted ensembling of large language models (LLMs) under correlated errors by adopting an information-theoretic lens. It introduces a Gaussian-copula latent-error model to capture inter-model dependencies and proves that, unlike the independent-error case, simply selecting the most accurate models is not always optimal. A greedy mutual-information (Greedy MI) algorithm is proposed to select a size-$k$ subset that maximizes information about the true label, accounting for both relevance and structured error correlations. The authors demonstrate, across MEDMCQA, MMLU, and IMDB, that Greedy MI consistently outperforms strong baselines under the same query budget, with theoretical insights into when and why ensemble gains saturate under correlation. The work provides a principled framework for robust, cost-aware LLM ensembling and highlights practical limits and directions for future diversity- and model-selection strategies in correlated ensembles.
Abstract
Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.
