A Bias-Variance-Covariance Decomposition of Kernel Scores for Generative Models

Sebastian G. Gruber; Florian Buettner

A Bias-Variance-Covariance Decomposition of Kernel Scores for Generative Models

Sebastian G. Gruber, Florian Buettner

TL;DR

This work introduces a bias-variance-covariance decomposition for kernel scores to analyze generalization and uncertainty in generative models. It defines distributional variance $\operatorname{Var}_k(P)$ and distributional covariance $\operatorname{Cov}_k(P,Q)$ and shows how the expected kernel score decomposes as $\mathbb{E}[S_k(\hat{P},Y)] = -\lVert Q\rVert_k^2 + \lVert \mathbb{E}[\hat{P}] - Q \rVert_k^2 + Var_k(\hat{P})$, with an ensemble adding a covariance term. The authors provide unbiased, consistent estimators $\widehat{\operatorname{Var}}_k^{(n,m)}$ and $\widehat{\operatorname{Cov}}_k^{(n,m)}$ that rely only on samples, enabling BVCD analysis for both open and closed-source models. Empirically, kernel entropy demonstrates strong predictive power for generalization in image and audio tasks and outperforms baselines in uncertainty estimation for NLP question answering on CoQA and TriviaQA. The framework offers a transferable, kernel-based approach to quantify uncertainty in diverse generative settings and provides practical guidance on kernel choice and sample requirements.

Abstract

Generative models, like large language models, are becoming increasingly relevant in our daily lives, yet a theoretical framework to assess their generalization behavior and uncertainty does not exist. Particularly, the problem of uncertainty estimation is commonly solved in an ad-hoc and task-dependent manner. For example, natural language approaches cannot be transferred to image generation. In this paper, we introduce the first bias-variance-covariance decomposition for kernel scores. This decomposition represents a theoretical framework from which we derive a kernel-based variance and entropy for uncertainty estimation. We propose unbiased and consistent estimators for each quantity which only require generated samples but not the underlying model itself. Based on the wide applicability of kernels, we demonstrate our framework via generalization and uncertainty experiments for image, audio, and language generation. Specifically, kernel entropy for uncertainty estimation is more predictive of performance on CoQA and TriviaQA question answering datasets than existing baselines and can also be applied to closed-source models.

A Bias-Variance-Covariance Decomposition of Kernel Scores for Generative Models

TL;DR

This work introduces a bias-variance-covariance decomposition for kernel scores to analyze generalization and uncertainty in generative models. It defines distributional variance

and distributional covariance

and shows how the expected kernel score decomposes as

, with an ensemble adding a covariance term. The authors provide unbiased, consistent estimators

and

that rely only on samples, enabling BVCD analysis for both open and closed-source models. Empirically, kernel entropy demonstrates strong predictive power for generalization in image and audio tasks and outperforms baselines in uncertainty estimation for NLP question answering on CoQA and TriviaQA. The framework offers a transferable, kernel-based approach to quantify uncertainty in diverse generative settings and provides practical guidance on kernel choice and sample requirements.

Abstract

Paper Structure (39 sections, 1 theorem, 49 equations, 21 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 1 theorem, 49 equations, 21 figures, 1 table, 1 algorithm.

Introduction
Background
Kernel Scores
Bias-Variance (-Covariance) Decompositions
Uncertainty in Natural Language Generation
A Bias-Variance-Covariance Decomposition of Kernel Scores
Predictive Kernel Entropy for Single Models.
Unbiased and Consistent Estimators
Distributional Variance
Distributional Covariance and Correlation
Applications
Image Generation
Audio Generation
Natural Language Generation
Limitations
...and 24 more sections

Key Result

Theorem 3.2

Let $S_k$ be a kernel score based on a p.s.d. kernel $k$ and $\hat{P}$ a predicted distribution for a target $Y \sim Q$, then If we have an ensemble prediction $\hat{P}^{\left( n \right)} \coloneqq \frac{1}{n} \sum_{i=1}^n \hat{P}_i$ with identically distributed members $\hat{P}_1, \dots, \hat{P}_n$, then

Figures (21)

Figure 1: Top: Illustration of predictive kernel entropy for a generative model. A kernel measures the pairwise similarities (red lines) of outputs in a vector space. The predictive kernel entropy is then the negative average kernel value. Bottom: The predictive kernel entropy shows the best performance among uncertainty approaches for single-model settings (c.f. Section \ref{['sec:applications_nlg']}).
Figure 2: Left: Illustration of the estimator $\widehat{\operatorname{Var}}_k^{\left(n,m\right)}$ in the sample space $\mathcal{X}$ for $n=2$ outer samples and $m=3$ inner samples. The estimator computes the average similarity within clusters (solid red lines) minus the average similarity between clusters (dotted blue lines). Shorter lines indicate higher similarity and larger kernel values. Right: Estimator standard deviation for various sample sizes. Even though the estimator does not converge in theory with the inner sample size $m$, it may still be influenced significantly by it for small sample sizes.
Figure 3: Left: The variance starts high and is reduced throughout training. From 20 epochs onwards, the variance stays stable for all classes and no overfitting can be observed. Mid: The bias is reduced a lot quicker than the variance, reaching its minimum at 5 epochs, and converges after 10 epochs. Right: The distributional correlation between training epochs shows similar to the variance that convergence happens around epoch 20. Remarkably, the 'square' of very high correlations indicates that the model is stable in its convergence and does not iterate through equally good solutions.
Figure 4: Left: Dependence between squared MMD and distributional variance. The distributional variance correlates linearly with the squared MMD. Right: Pearson correlation between squared MMD and distributional variance is very high ($\approx 0.95$) throughout training. Approximation via deep ensembles does not deteriorate this relation. Consequently, distributional variance and kernel entropy represent viable measures of uncertainty.
Figure 5: MMD$^2$, variance, and bias for class '0' throughout training with reduced training set of '0's. After 5 epochs, mode collapse occurs, which is only expressed in the increased bias. This indicates, that mode collapse is a contrary phenomenon to overfitting.
...and 16 more figures

Theorems & Definitions (4)

Definition 3.1
Theorem 3.2
Example 3.3
Example 3.4

A Bias-Variance-Covariance Decomposition of Kernel Scores for Generative Models

TL;DR

Abstract

A Bias-Variance-Covariance Decomposition of Kernel Scores for Generative Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (4)