Table of Contents
Fetching ...

The Structure of Cross-Validation Error: Stability, Covariance, and Minimax Limits

Ido Nachum, Rüdiger Urbanke, Thomas Weinberger

TL;DR

The paper addresses how to choose the number of folds in $k$-fold cross-validation by introducing a novel mean-squared-error decomposition that separates Squared Loss Stability (SLS) from inter-fold covariance. It proves a minimax lower bound $\mathfrak{R}_{CV}(\mathcal{A}) = \Omega(\sqrt{k}/n)$ for ERM algorithms and shows that certain learning rules can only achieve hold-out-like rates, while others can reach hold-out performance, depending on algorithmic properties. The majority of the results hinge on the SLS-covariance decomposition, which clarifies when CV advantages accrue and how dependence between folds impedes unbiased risk estimation. The work also provides tight analyses for the linear function class and the majority algorithm, establishing benchmarks and concrete asymptotics that inform practical fold choices. Overall, the paper reveals fundamental limits of resampling-based risk estimation and proposes principled baselines, notably the Majority and square-wave constructions, to guide future theoretical refinements and practical practice.

Abstract

Despite ongoing theoretical research on cross-validation (CV), many theoretical questions about CV remain widely open. This motivates our investigation into how properties of algorithm-distribution pairs can affect the choice for the number of folds in $k$-fold cross-validation. Our results consist of a novel decomposition of the mean-squared error of cross-validation for risk estimation, which explicitly captures the correlations of error estimates across overlapping folds and includes a novel algorithmic stability notion, squared loss stability, that is considerably weaker than the typically required hypothesis stability in other comparable works. Furthermore, we prove: 1. For every learning algorithm that minimizes empirical error, a minimax lower bound on the mean-squared error of $k$-fold CV estimating the population risk $L_\mathcal{D}$: \[ \min_{k \mid n}\; \max_{\mathcal{D}}\; \mathbb{E}\!\left[\big(\widehat{L}_{\mathrm{CV}}^{(k)} - L_{\mathcal{D}}\big)^{2}\right] \;=\; Ω\!\big(\sqrt{k}/n\big), \] where $n$ is the sample size and $k$ the number of folds. This shows that even under idealized conditions, for large values of $k$, CV cannot attain the optimum of order $1/n$ achievable by a validation set of size $n$, reflecting an inherent penalty caused by dependence between folds. 2. Complementing this, we exhibit learning rules for which \[ \max_{\mathcal{D}}\; \mathbb{E}\!\left[\big(\widehat{L}_{\mathrm{CV}}^{(k)} - L_{\mathcal{D}}\big)^{2}\right] \;=\; Ω(k/n), \] matching (up to constants) the accuracy of a hold-out estimator of a single fold of size $n/k$. Together these results delineate the fundamental trade-off in resampling-based risk estimation: CV cannot fully exploit all $n$ samples for unbiased risk evaluation, and its minimax performance is pinned between the $k/n$ and $\sqrt{k}/n$ regimes.

The Structure of Cross-Validation Error: Stability, Covariance, and Minimax Limits

TL;DR

The paper addresses how to choose the number of folds in -fold cross-validation by introducing a novel mean-squared-error decomposition that separates Squared Loss Stability (SLS) from inter-fold covariance. It proves a minimax lower bound for ERM algorithms and shows that certain learning rules can only achieve hold-out-like rates, while others can reach hold-out performance, depending on algorithmic properties. The majority of the results hinge on the SLS-covariance decomposition, which clarifies when CV advantages accrue and how dependence between folds impedes unbiased risk estimation. The work also provides tight analyses for the linear function class and the majority algorithm, establishing benchmarks and concrete asymptotics that inform practical fold choices. Overall, the paper reveals fundamental limits of resampling-based risk estimation and proposes principled baselines, notably the Majority and square-wave constructions, to guide future theoretical refinements and practical practice.

Abstract

Despite ongoing theoretical research on cross-validation (CV), many theoretical questions about CV remain widely open. This motivates our investigation into how properties of algorithm-distribution pairs can affect the choice for the number of folds in -fold cross-validation. Our results consist of a novel decomposition of the mean-squared error of cross-validation for risk estimation, which explicitly captures the correlations of error estimates across overlapping folds and includes a novel algorithmic stability notion, squared loss stability, that is considerably weaker than the typically required hypothesis stability in other comparable works. Furthermore, we prove: 1. For every learning algorithm that minimizes empirical error, a minimax lower bound on the mean-squared error of -fold CV estimating the population risk : \[ \min_{k \mid n}\; \max_{\mathcal{D}}\; \mathbb{E}\!\left[\big(\widehat{L}_{\mathrm{CV}}^{(k)} - L_{\mathcal{D}}\big)^{2}\right] \;=\; Ω\!\big(\sqrt{k}/n\big), \] where is the sample size and the number of folds. This shows that even under idealized conditions, for large values of , CV cannot attain the optimum of order achievable by a validation set of size , reflecting an inherent penalty caused by dependence between folds. 2. Complementing this, we exhibit learning rules for which \[ \max_{\mathcal{D}}\; \mathbb{E}\!\left[\big(\widehat{L}_{\mathrm{CV}}^{(k)} - L_{\mathcal{D}}\big)^{2}\right] \;=\; Ω(k/n), \] matching (up to constants) the accuracy of a hold-out estimator of a single fold of size . Together these results delineate the fundamental trade-off in resampling-based risk estimation: CV cannot fully exploit all samples for unbiased risk evaluation, and its minimax performance is pinned between the and regimes.

Paper Structure

This paper contains 47 sections, 48 theorems, 327 equations.

Key Result

Lemma 2.4

Assume that the loss functional is bounded between $0$ and $1$ and that the risk has means $\mathbb E[L]=\bar{L}$ and $\mathbb E[L^{(k)}]=\bar{L}^{(k)}$ and denote the variances loss as $\sigma_n^2:=\text{Var}(L)$ and $\sigma_{n-m}^2:=\text{Var}(L_1^{(k)})$. Then, the squared loss stability $\mathop

Theorems & Definitions (99)

  • Definition 2.1: Hypothesis Stability
  • Definition 2.2: Loss Stability
  • Definition 2.3: Squared Loss Stability
  • Lemma 2.4: Bounds on the Squared Loss Stability
  • Lemma 4.1: Decomposition of the MSE
  • proof
  • Lemma 4.2: Expected Risk Variance for Bounded Loss
  • proof
  • Theorem 4.3: Characterization of the MSE
  • proof
  • ...and 89 more