Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

Raphaël Razafindralambo; Rémy Sun; Frédéric Precioso; Damien Garreau; Pierre-Alexandre Mattei

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Damien Garreau, Pierre-Alexandre Mattei

TL;DR

This work studies the normalized generalized mean of order of order r through the lens of log-likelihood, the standard evaluation criterion in machine learning to provide a unifying aggregation formalism and shows different optimal configurations for different situations.

Abstract

Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation remains an open question with two commonly proposed approaches being linear pooling (probability averaging) and geometric pooling (logit averaging). In this work, we address this question by studying the normalized generalized mean of order $r \in \mathbb{R} \cup \{-\infty,+\infty\}$ through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime $r \in [0,1]$ is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear ($r=1$) and geometric ($r=0$) pooling. In contrast, we show that aggregation rules with $r \notin [0,1]$ may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

TL;DR

Abstract

through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime

is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear (

) and geometric (

) pooling. In contrast, we show that aggregation rules with

may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.

Paper Structure (38 sections, 7 theorems, 67 equations, 7 figures)

This paper contains 38 sections, 7 theorems, 67 equations, 7 figures.

Introduction
Background and notations on generalized means
Canonical examples: linear and logarithmic pooling
Generalized power mean
Generalized means of probability densities.
Related work
Likelihood guarantees for generalized mean aggregation
Does every order $r$ define a density?
When is the generalized mean reliably beneficial?
Failure cases outside the reliability range
The aggregation breaks down at disagreement points.
The aggregation breaks down at consensus points.
Experiments
How to build models to ensemble.
Datasets and architectures.
...and 23 more sections

Key Result

Lemma 2.1

Let $a_1,\dots,a_k$ be positive values. If $r<s$, where equality holds if and only if $a_1 = a_2 = \dots = a_k$. This result generalizes the inequality chain $\frac{a_1+a_2}{2} \geq \sqrt{a_1 a_2} \geq \frac{2}{\frac{1}{a_1} + \frac{1}{a_2}}$ where the left inequality is known as AM-GM.

Figures (7)

Figure 1: Only intermediate powers $r$, particularly $r \in [0,1]$, yield consistent NLL improvements, whereas extreme values can degrade performance.Top: aggregated densities $\overset{\circ}{p}_2$ obtained from two Gaussian experts $p^{(1)}$ and $p^{(2)}$ for various $r$. Small $r$ concentrates mass in a single region, while large positive $r$ preserves the bimodal structure of the experts. Bottom: NLL evaluated on samples $y \sim \mathcal{N}(0,4^2)$ (purple ticks indicate test samples). In this setting, negative $r$ (min-type behavior) underperforms the average individual NLL (dashed line), whereas positive $r$ achieves the lowest values. Our theory (\ref{['theorem: log-likelihood inequality']}) identifies $r \in [0,1]$ as the provably reliable interval. See Appendix \ref{['app: additional gaussian']} for different behaviors.
Figure 2: Visual illustration of \ref{['theorem: counterexamples']}. Aggregation fails at points where densities strongly differ when $r<0$, and at consensus points when $r>1$. We look at the location of the gap in \ref{['eq: bad inequality likelihood']} across $r$ regimes. "Avg log" denotes the average of individual log-likelihoods (right-hand side of \ref{['eq: bad inequality likelihood']}). Left (a): We consider two gaussians $\mathcal{N}(\pm2.5,1)$. Center (b): For $r<0$, the inequality of \ref{['theorem: log-likelihood inequality']} is satisfied at $x=2.5$. We observe that the gap is not confined and amplifies when $x$ increases. Right (c): For $r \ge 1$, the AM ($r=1$) globally dominates the average log-density, while all $r>1$ exhibit a very localized (but non-punctual) negative effect around $x=0$.
Figure 3: Illustration of the cross-entropy behavior of the normalized power mean likelihood $\overset{\circ}{p}_{k,r}$ (via cross-entropy) in three classification settings. All plots exhibit a U-shaped performance curve: extreme aggregation amplifies model disagreement, whereas optimal values lie in $[0,1]$. Consistent with \ref{['theorem: log-likelihood inequality']}, the regime $r \in [0,1]$ remains reliably below the individual model uncertainty band. By contrast, negative orders ($r<0$) perform poorly, likely due to disagreements on dominant classes, as discussed in \ref{['section: failure cases cex']}.
Figure 4: Illustration of the cross-entropy behavior of the normalized power mean likelihood $\overset{\circ}{p}_{k,r}$, focusing on values of $r$ around $[0,1]$ (zoomed view of \ref{['fig:global_r']}). On MedMNIST (b) and IMDb (c), the optimal likelihood lies within the reliability interval $[0,1]$ (\ref{['theorem: log-likelihood inequality']}), whereas on CIFAR-100 (a) it lies slightly beyond it ($\approx 1.4$). This shows that while the interval $[0,1]$ provides a stable and reliable regime, mild optimism outside it can still be beneficial in practice.
Figure 5: Illustration of the cross-entropy behavior of the normalized power mean likelihood $\overset{\circ}{p}_{k,r}$ on CIFAR-100 in a controlled near-consensus regime. The trend contrasts with \ref{['fig:global_r']}: extreme optimistic aggregation becomes harmful and the maximum operator performs worst.
...and 2 more figures

Theorems & Definitions (12)

Lemma 2.1: Power mean inequality, hardy1952inequalities; Chapter 2
Definition 2.1: Generalized power mean of densities
Proposition 3.1: Generalized means of order $r$ are well defined
proof
Theorem 3.1: Wisdom of Crowds on Log-likelihood
proof
Theorem 3.2: Failure of the Wisdom of Crowds
proof
Lemma D.1: Multinomial theorem
Proposition D.1
...and 2 more

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

TL;DR

Abstract

Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (12)