Table of Contents
Fetching ...

Cross-validation: what does it estimate and how well does it do it?

Stephen Bates, Trevor Hastie, Robert Tibshirani

TL;DR

It is proved that cross-validation estimates the average prediction error of models fit on other unseen training sets drawn from the same population, rather than the ordinary least squares, for the linear model fit by ordinary least squares.

Abstract

Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.

Cross-validation: what does it estimate and how well does it do it?

TL;DR

It is proved that cross-validation estimates the average prediction error of models fit on other unseen training sets drawn from the same population, rather than the ordinary least squares, for the linear model fit by ordinary least squares.

Abstract

Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp. Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and we show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.

Paper Structure

This paper contains 53 sections, 19 theorems, 80 equations, 26 figures, 6 tables, 1 algorithm.

Key Result

Lemma 1

When using OLS as the fitting algorithm and squared-error loss, the cross-validation estimate of prediction error, $\widehat{\hbox{Err}}^{(\textnormal{CV})}$, is linearly invariant.

Figures (26)

  • Figure 1: A plot of the true error of a model versus the CV estimates for 1000 replicates of the model from Section \ref{['subsec:harrell_model']}. The blue curve shows the average midpoint of the naïve CV confidence intervals. The green bands show the average 90% confidence interval for prediction error given by naïve CV. The red curves show the 5% and 95% quantiles from a quantile regression fit. To achieve nominal coverage, the green curves should approximate the red curves, but they are too narrow in this case.
  • Figure 2: Possible targets of inference for cross-validation. Here, $(X,Y)$ is the training data and $\hbox{Err}_{XY}$ is the average error of the model fit on $(X,Y)$ on a test data set of infinite size. From left to right, the random variables above are a constant, a function of $X$ only, and a function of $(X,Y)$.
  • Figure 3: Left: mean squared error of the CV point estimate of prediction error relative to three different estimands: $\hbox{Err}$, $\hbox{Err}_X$, and $\hbox{Err}_{XY}$. Center: coverage of $\hbox{Err}$, $\hbox{Err}_X$, and $\hbox{Err}_{XY}$ by the naïve cross-validation intervals in a homoskedastic Gaussian linear model. The nominal miscoverage rate is 10%. Each pair of points connected by a line represents 2000 replicates with the same feature matrix $X$. Right: $2000$ replicates with the same feature matrix and the line of best fit (blue).
  • Figure 4: The relationship among various notions of prediction error in the proportional asymptotic limit \ref{['eq:highd_limit_def']}. Recall that $\sigma^2$ is the Bayes error: the error rate of the best possible model. See Figure \ref{['fig:highd_rates']} for a simulation experiment demonstrating these rates. $^*$The variance of $\widehat{\hbox{Err}}$ scales as $1/\sqrt{n}$; see Section \ref{['subsec:cv_bias']} for details about the bias.
  • Figure 5: Simulation results demonstrating the asympotic scaling presented in Figure \ref{['fig:highd_asymptotics_viz']}. The fitted slopes of the lines (after log-transforming both axes) are $0.00, -0.46, -0.50, -1.01$, from top to bottom. See Section \ref{['subsec:cv_bias']} for details about the rate of $\widehat{\hbox{Err}}$
  • ...and 21 more figures

Theorems & Definitions (41)

  • Definition 1: Linearly invariant estimator
  • Lemma 1
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:ols_cv_independent_err']}
  • Corollary 1
  • Remark 1
  • Theorem 2
  • Corollary 2
  • Corollary 3
  • Remark 2
  • ...and 31 more