Table of Contents
Fetching ...

Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu

TL;DR

This work addresses budgeted ensembling of large language models (LLMs) under correlated errors by adopting an information-theoretic lens. It introduces a Gaussian-copula latent-error model to capture inter-model dependencies and proves that, unlike the independent-error case, simply selecting the most accurate models is not always optimal. A greedy mutual-information (Greedy MI) algorithm is proposed to select a size-$k$ subset that maximizes information about the true label, accounting for both relevance and structured error correlations. The authors demonstrate, across MEDMCQA, MMLU, and IMDB, that Greedy MI consistently outperforms strong baselines under the same query budget, with theoretical insights into when and why ensemble gains saturate under correlation. The work provides a principled framework for robust, cost-aware LLM ensembling and highlights practical limits and directions for future diversity- and model-selection strategies in correlated ensembles.

Abstract

Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.

Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

TL;DR

This work addresses budgeted ensembling of large language models (LLMs) under correlated errors by adopting an information-theoretic lens. It introduces a Gaussian-copula latent-error model to capture inter-model dependencies and proves that, unlike the independent-error case, simply selecting the most accurate models is not always optimal. A greedy mutual-information (Greedy MI) algorithm is proposed to select a size- subset that maximizes information about the true label, accounting for both relevance and structured error correlations. The authors demonstrate, across MEDMCQA, MMLU, and IMDB, that Greedy MI consistently outperforms strong baselines under the same query budget, with theoretical insights into when and why ensemble gains saturate under correlation. The work provides a principled framework for robust, cost-aware LLM ensembling and highlights practical limits and directions for future diversity- and model-selection strategies in correlated ensembles.

Abstract

Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.
Paper Structure (42 sections, 10 theorems, 79 equations, 24 figures, 15 tables, 6 algorithms)

This paper contains 42 sections, 10 theorems, 79 equations, 24 figures, 15 tables, 6 algorithms.

Key Result

Theorem 4.1

Consider an ensemble of independent BSCs where channel $j$ has error rate $\epsilon_j < 1/2$. Let the channels be indexed by accuracy such that we have $\epsilon_1 \le \epsilon_2 \le \ldots \le \epsilon_m$. Let $H_k = \{1, \dots, k\}$ be the set of top-$k$ most accurate channels. Let $S^{MI}_k \in \ Consequently, when the channels are independent, the error-optimal subset $S^\star$ and the mutual-

Figures (24)

  • Figure 1: Candidate LLM pool for ensembling: stars indicate each model’s accuracy, while chain links denote pairwise correlation (shared error patterns) with other models.
  • Figure 2: Example from MEDMCQA. (a) A diverse ensemble with lower average accuracy (72%) answers correctly by combining models from different families. (b) Selecting the ensemble with strongest models with higher average accuracy (81%) fails because almost all GPT models share the same error pattern for this example. Please note that the original multiple-choice question is converted to binary format: for each candidate answer, we ask "Is the answer '[candidate]'?" with ground truth $Y=+1$ (correct) or $Y=-1$ (incorrect).
  • Figure 3: Our Greedy Mutual Information (MI)-based model selection framework.
  • Figure 4: Gaussian-copula validation on MEDMCQA for $\text{temp}=0.7$, (run 1). (a) Empirical versus copula-modeled pairwise joint error probs. $P(E_i\cap E_j)$. $\!$(b) Comparison of higher-order simultaneous error distributions between real data and copula samples.
  • Figure 5: MEDMCQA test error vs. ensemble size (mean over temperatures, runs, and random splits). Shaded region represents the standard deviation.
  • ...and 19 more figures

Theorems & Definitions (29)

  • Theorem 4.1: Alignment of Error and Mutual Information
  • Remark 4.2
  • Theorem 4.3: Accuracy-Redundancy-Error Decomposition
  • Theorem 4.4: MAP Information Saturation
  • Definition 1.1: Binary Symmetric Channel
  • Definition 1.2: Stochastic Degradation
  • Lemma 1.3: BSC Degradation
  • proof
  • Definition 1.4: Top-$k$ Ensemble
  • proof
  • ...and 19 more