Table of Contents
Fetching ...

Bias-variance decompositions: the exclusive privilege of Bregman divergences

Tom Heskes

TL;DR

This paper addresses when a bias-variance decomposition can be cleanly separated for general loss functions. It shows that, under mild regularity conditions and the identity-of-indiscernibles, only g-Bregman divergences admit a clean decomposition, with the standard Mahalanobis form arising for symmetric cases after a variable transformation. It also clarifies the role of equality constraints, equality-relaxations, and connections to exponential-family KL divergences, indicating that classical losses such as 0-1 or L1 do not support a neat bias-variance split. The findings delineate the boundaries of decomposability in loss design, with implications for understanding generalization and the applicability of bias-variance analyses in non-quadratic settings.

Abstract

Bias-variance decompositions are widely used to understand the generalization performance of machine learning models. While the squared error loss permits a straightforward decomposition, other loss functions - such as zero-one loss or $L_1$ loss - either fail to sum bias and variance to the expected loss or rely on definitions that lack the essential properties of meaningful bias and variance. Recent research has shown that clean decompositions can be achieved for the broader class of Bregman divergences, with the cross-entropy loss as a special case. However, the necessary and sufficient conditions for these decompositions remain an open question. In this paper, we address this question by studying continuous, nonnegative loss functions that satisfy the identity of indiscernibles (zero loss if and only if the two arguments are identical), under mild regularity conditions. We prove that so-called $g$-Bregman or rho-tau divergences are the only such loss functions that have a clean bias-variance decomposition. A $g$-Bregman divergence can be transformed into a standard Bregman divergence through an invertible change of variables. This makes the squared Mahalanobis distance, up to such a variable transformation, the only symmetric loss function with a clean bias-variance decomposition. Consequently, common metrics such as $0$-$1$ and $L_1$ losses cannot admit a clean bias-variance decomposition, explaining why previous attempts have failed. We also examine the impact of relaxing the restrictions on the loss functions and how this affects our results.

Bias-variance decompositions: the exclusive privilege of Bregman divergences

TL;DR

This paper addresses when a bias-variance decomposition can be cleanly separated for general loss functions. It shows that, under mild regularity conditions and the identity-of-indiscernibles, only g-Bregman divergences admit a clean decomposition, with the standard Mahalanobis form arising for symmetric cases after a variable transformation. It also clarifies the role of equality constraints, equality-relaxations, and connections to exponential-family KL divergences, indicating that classical losses such as 0-1 or L1 do not support a neat bias-variance split. The findings delineate the boundaries of decomposability in loss design, with implications for understanding generalization and the applicability of bias-variance analyses in non-quadratic settings.

Abstract

Bias-variance decompositions are widely used to understand the generalization performance of machine learning models. While the squared error loss permits a straightforward decomposition, other loss functions - such as zero-one loss or loss - either fail to sum bias and variance to the expected loss or rely on definitions that lack the essential properties of meaningful bias and variance. Recent research has shown that clean decompositions can be achieved for the broader class of Bregman divergences, with the cross-entropy loss as a special case. However, the necessary and sufficient conditions for these decompositions remain an open question. In this paper, we address this question by studying continuous, nonnegative loss functions that satisfy the identity of indiscernibles (zero loss if and only if the two arguments are identical), under mild regularity conditions. We prove that so-called -Bregman or rho-tau divergences are the only such loss functions that have a clean bias-variance decomposition. A -Bregman divergence can be transformed into a standard Bregman divergence through an invertible change of variables. This makes the squared Mahalanobis distance, up to such a variable transformation, the only symmetric loss function with a clean bias-variance decomposition. Consequently, common metrics such as - and losses cannot admit a clean bias-variance decomposition, explaining why previous attempts have failed. We also examine the impact of relaxing the restrictions on the loss functions and how this affects our results.

Paper Structure

This paper contains 7 sections, 21 theorems, 114 equations, 1 figure.

Key Result

Proposition 2

The specific ordering of $\mathbf{T}$, $\mathbf{t}^*$, $\mathbf{y}^*$, and $\mathbf{Y}$ on the right-hand side of (eq_clean) is the only ordering that can lead to a clean bias-variance decomposition for asymmetric loss functions $\ell$ with the identity of indiscernibles. For asymmetric loss functio

Figures (1)

  • Figure 1: Sketch of the construction used in the proof of Lemma \ref{['th_factorization']}. We start from a distribution with just two predictions $\mathbf{y}_1$ and $\mathbf{y}_2$, here each with probability $1/2$, leading to the central prediction $\mathbf{y}^*$. With e E ach line segment symbolically represent ing s a loss , -- for example, $v_1 = \ell(\mathbf{y}^*,\mathbf{y}_1)$and so on, -- with the arrow pointing from the second to the first argument. Specializing the bias-variance decomposition (\ref{['eq_singlelabel']}) implies to this two prediction case gives $(l_1 + l_2)/2 = b + (v_1 + v_2)/2$. The parallellogram indicated by the solid lines is essentially the same construction as in Figure 10 of nielsen2021parallel. Next, we consider a slight, infinitesimal change in the predictions, yielding $\mathbf{y}_1'$ and $\mathbf{y}_2'$, that keeps the central prediction $\mathbf{y}^*$ and hence the bias $b$ invariant. This invariance implies $(l_1' + l_2') - (v_1' + v_2') = 2 b = (l_1 + l_2) - (v_1 + v_2)$. The requirement that this should hold for any such change that keeps the central prediction intact, also if this $\mathbf{y}^*$ is not necessarily close to the label $\mathbf{t}$, leads to a strong constraint on the form of the loss function's mixed second derivative. Requiring this to hold for any such change that keeps the central prediction intact -- even when $\mathbf{y}^*$ is not close to the label $\mathbf{t}$ -- places a strong constraint on the mixed second derivative of the loss function.

Theorems & Definitions (23)

  • Definition 1
  • Proposition 2
  • Proposition 3
  • Definition 4
  • Proposition 5
  • Lemma 6
  • Lemma 7
  • Theorem 8
  • Theorem 9
  • Corollary 10
  • ...and 13 more