Table of Contents
Fetching ...

A Unified Theory of Diversity in Ensemble Learning

Danny Wood, Tingting Mu, Andrew Webb, Henry Reeve, Mikel Luján, Gavin Brown

TL;DR

This work introduces a unified theory of ensemble diversity by showing that diversity is a hidden dimension of the bias-variance decomposition across a broad family of losses. It derives exact bias-variance-diversity decompositions for squared loss, cross-entropy, and Poisson-like losses, with a centroid combiner that depends on the loss (e.g., arithmetic mean for squared loss, normalized geometric mean for KL-divergence). The paper also extends the framework to the wider class of Bregman divergences, providing analytical forms for centroids and revealing loss-specific geometry and diversity properties, including label-dependent effects for non-additive losses like 0/1. It offers practical methods to estimate bias, variance, and diversity from data and discusses how diversity contributes to model fit rather than simply being maximized, enabling a three-way bias-variance-diversity trade-off with implications for ensemble design and regularisation. Overall, the theory unifies diverse notions of ensemble diversity under loss-driven BV decompositions, guiding principled ensemble construction and interpretation.

Abstract

We present a theory of ensemble diversity, explaining the nature of diversity for a wide range of supervised learning scenarios. This challenge has been referred to as the holy grail of ensemble learning, an open research issue for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of the ensemble loss. We prove a family of exact bias-variance-diversity decompositions, for a wide range of losses in both regression and classification, e.g., squared, cross-entropy, and Poisson losses. For losses where an additive bias-variance decomposition is not available (e.g., 0/1 loss) we present an alternative approach: quantifying the effects of diversity, which turn out to be dependent on the label distribution. Overall, we argue that diversity is a measure of model fit, in precisely the same sense as bias and variance, but accounting for statistical dependencies between ensemble members. Thus, we should not be maximising diversity as so many works aim to do -- instead, we have a bias/variance/diversity trade-off to manage.

A Unified Theory of Diversity in Ensemble Learning

TL;DR

This work introduces a unified theory of ensemble diversity by showing that diversity is a hidden dimension of the bias-variance decomposition across a broad family of losses. It derives exact bias-variance-diversity decompositions for squared loss, cross-entropy, and Poisson-like losses, with a centroid combiner that depends on the loss (e.g., arithmetic mean for squared loss, normalized geometric mean for KL-divergence). The paper also extends the framework to the wider class of Bregman divergences, providing analytical forms for centroids and revealing loss-specific geometry and diversity properties, including label-dependent effects for non-additive losses like 0/1. It offers practical methods to estimate bias, variance, and diversity from data and discusses how diversity contributes to model fit rather than simply being maximized, enabling a three-way bias-variance-diversity trade-off with implications for ensemble design and regularisation. Overall, the theory unifies diverse notions of ensemble diversity under loss-driven BV decompositions, guiding principled ensemble construction and interpretation.

Abstract

We present a theory of ensemble diversity, explaining the nature of diversity for a wide range of supervised learning scenarios. This challenge has been referred to as the holy grail of ensemble learning, an open research issue for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of the ensemble loss. We prove a family of exact bias-variance-diversity decompositions, for a wide range of losses in both regression and classification, e.g., squared, cross-entropy, and Poisson losses. For losses where an additive bias-variance decomposition is not available (e.g., 0/1 loss) we present an alternative approach: quantifying the effects of diversity, which turn out to be dependent on the label distribution. Overall, we argue that diversity is a measure of model fit, in precisely the same sense as bias and variance, but accounting for statistical dependencies between ensemble members. Thus, we should not be maximising diversity as so many works aim to do -- instead, we have a bias/variance/diversity trade-off to manage.
Paper Structure (52 sections, 23 theorems, 80 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 52 sections, 23 theorems, 80 equations, 17 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Given a label $y\in\mathbb{R}$, a member prediction $q_i({\bf x})$, and an ensemble $\bar{q}({\bf x})= \frac{1}{m}\sum_{i=1}^m q_i({\bf x})$, we have,

Figures (17)

  • Figure 1: Parallel vs sequential ensemble construction. Both can be seen as creating "diverse" models in some sense---either implicitly (independently re-sampling the training data), or explicitly (re-sampling according to the errors of earlier models).
  • Figure 2: Accuracy/diversity for two (hypothetical) diversity measures. Measure B (right) is more desirable, as it has stronger correlation to performance improvement.
  • Figure 3: The classic dartboard analogy for explaining bias and variance.
  • Figure 4: Dartboard diagram illustrating bias/variance for the KL-divergence.
  • Figure 5: Bagging depth 8 trees, increasing ensemble size (California Housing data).
  • ...and 12 more figures

Theorems & Definitions (27)

  • Theorem 1: Ambiguity decomposition, Krogh & Vedelsby, 1994
  • Definition 2: Generalised Bias-Variance Decomposition
  • Proposition 3: Generalised Ambiguity Decomposition
  • Definition 4: Centroid Combiner rule
  • Theorem 5: Generalized Bias-Variance-Diversity decomposition
  • Theorem 6: Bias-Variance Effect decomposition, James & Hastie 1997
  • Proposition 6: Ambiguity-Effect Decomposition
  • Theorem 7: Bias-Variance-Diversity effect decomposition
  • Corollary 7
  • Theorem 8: Non-existence of label-independent diversity-effect for 0/1 loss
  • ...and 17 more