Table of Contents
Fetching ...

Cross-Validation with Antithetic Gaussian Randomization

Sifan Liu, Snigdha Panigrahi, Jake A. Soloff

TL;DR

This paper introduces an antithetic Gaussian randomization-based cross-validation method that eliminates the need for data-splitting in predicting error, particularly for non-IID data. By coupling two control parameters, $\alpha$ (bias) and $K$ (repetitions), the authors achieve vanishing bias as $\alpha\to0$ while keeping the variance bounded for fixed $K$, thanks to a carefully designed zero-sum, equicorrelated Gaussian perturbation. The approach ties to SURE via a convolution-smoothed estimator and extends to exponential-family losses and GLMs, with theoretical guarantees on bias and variance and practical validation across isotonic, logistic, and MLP regression tasks. Compared with standard cross-validation and the coupled bootstrap, the proposed method yields lower mean squared error in core scenarios and offers computational advantages, especially for hyperparameter tuning. Overall, the work provides a versatile, assumption-light, and scalable framework for estimating prediction error without sample splitting, with broad applicability to nonparametric and high-dimensional settings.

Abstract

We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. The method is well-suited for problems where sample splitting is infeasible, such as when data violate the assumption of independent and identical distribution. Even when sample splitting is possible, our method offers a computationally efficient alternative for estimating the prediction error, achieving comparable or even lower error than standard cross-validation in a few train-test repetitions. Drawing inspiration from recent techniques like data-fission and data-thinning, our method constructs train-test data pairs using externally generated Gaussian randomization variables. The key innovation lies in a carefully designed correlation structure among the randomization variables, which we refer to as antithetic Gaussian randomization. In theory, we show that this correlation is crucial in ensuring that the variance of our estimator remains bounded while allowing the bias to vanish. Through simulations on various data types and loss functions, we highlight the advantages of our antithetic Gaussian randomization scheme over both independent randomization and standard cross-validation, where the bias-variance tradeoff depends heavily on the number of folds.

Cross-Validation with Antithetic Gaussian Randomization

TL;DR

This paper introduces an antithetic Gaussian randomization-based cross-validation method that eliminates the need for data-splitting in predicting error, particularly for non-IID data. By coupling two control parameters, (bias) and (repetitions), the authors achieve vanishing bias as while keeping the variance bounded for fixed , thanks to a carefully designed zero-sum, equicorrelated Gaussian perturbation. The approach ties to SURE via a convolution-smoothed estimator and extends to exponential-family losses and GLMs, with theoretical guarantees on bias and variance and practical validation across isotonic, logistic, and MLP regression tasks. Compared with standard cross-validation and the coupled bootstrap, the proposed method yields lower mean squared error in core scenarios and offers computational advantages, especially for hyperparameter tuning. Overall, the work provides a versatile, assumption-light, and scalable framework for estimating prediction error without sample splitting, with broad applicability to nonparametric and high-dimensional settings.

Abstract

We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. The method is well-suited for problems where sample splitting is infeasible, such as when data violate the assumption of independent and identical distribution. Even when sample splitting is possible, our method offers a computationally efficient alternative for estimating the prediction error, achieving comparable or even lower error than standard cross-validation in a few train-test repetitions. Drawing inspiration from recent techniques like data-fission and data-thinning, our method constructs train-test data pairs using externally generated Gaussian randomization variables. The key innovation lies in a carefully designed correlation structure among the randomization variables, which we refer to as antithetic Gaussian randomization. In theory, we show that this correlation is crucial in ensuring that the variance of our estimator remains bounded while allowing the bias to vanish. Through simulations on various data types and loss functions, we highlight the advantages of our antithetic Gaussian randomization scheme over both independent randomization and standard cross-validation, where the bias-variance tradeoff depends heavily on the number of folds.

Paper Structure

This paper contains 30 sections, 15 theorems, 135 equations, 5 figures.

Key Result

Lemma 3.1

Let $f$ be an integrable function under the Gaussian distribution $\mathcal{N}(\theta, \sigma^2 I_n)$. Then

Figures (5)

  • Figure 1: Mean squared error (MSE) for estimating prediction error in an isotonic regression problem using a simulated dataset. From left to right, the methods shown are classic 2-fold CV, LOO CV, and the proposed method with $K=2$ and $\alpha=0.01$. Additional details are provided in Section \ref{['sec: experiments']}.
  • Figure 2: In the isotonic regression simulations, the data are generated based on the function $f^*$, shown in solid line. An example of a simulated dataset is displayed as scatter points.
  • Figure 3: Mean squared error of standard $K$-fold CV, coupled bootstrap with $K$ repetitions, and the proposed Antithetic CV methods with $K$ repetitions. Left panel: MSE with $\alpha=0.1$ and varying $K$. Right panel: MSE with $K=8$ and varying $\alpha$.
  • Figure 4: MSE in estimating the prediction error in the logistic regression example. We set $\alpha=0.1$ and consider $K=10$ and 20.
  • Figure 5: MSE in estimating the prediction error in the MLP regression example. We consider $K=10$ and 20. For CB and antithetic CV, $\alpha$ is set to be 0.1.

Theorems & Definitions (34)

  • Lemma 3.1: Approximation to the identity
  • proof : Proof of Lemma \ref{['lem: approximation to identity']}
  • Theorem 3.2: Bias
  • proof : Proof of Theorem \ref{['thm: bias']}
  • Theorem 3.3: Reducible variance
  • Remark 3.1
  • proof : Proof sketch of Theorem \ref{['thm: reducible variance']}
  • Theorem 3.4: Irreducible variance
  • Corollary 1
  • Proposition 4.1: Connection with SURE
  • ...and 24 more