Cross-Validation with Antithetic Gaussian Randomization
Sifan Liu, Snigdha Panigrahi, Jake A. Soloff
TL;DR
This paper introduces an antithetic Gaussian randomization-based cross-validation method that eliminates the need for data-splitting in predicting error, particularly for non-IID data. By coupling two control parameters, $\alpha$ (bias) and $K$ (repetitions), the authors achieve vanishing bias as $\alpha\to0$ while keeping the variance bounded for fixed $K$, thanks to a carefully designed zero-sum, equicorrelated Gaussian perturbation. The approach ties to SURE via a convolution-smoothed estimator and extends to exponential-family losses and GLMs, with theoretical guarantees on bias and variance and practical validation across isotonic, logistic, and MLP regression tasks. Compared with standard cross-validation and the coupled bootstrap, the proposed method yields lower mean squared error in core scenarios and offers computational advantages, especially for hyperparameter tuning. Overall, the work provides a versatile, assumption-light, and scalable framework for estimating prediction error without sample splitting, with broad applicability to nonparametric and high-dimensional settings.
Abstract
We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. The method is well-suited for problems where sample splitting is infeasible, such as when data violate the assumption of independent and identical distribution. Even when sample splitting is possible, our method offers a computationally efficient alternative for estimating the prediction error, achieving comparable or even lower error than standard cross-validation in a few train-test repetitions. Drawing inspiration from recent techniques like data-fission and data-thinning, our method constructs train-test data pairs using externally generated Gaussian randomization variables. The key innovation lies in a carefully designed correlation structure among the randomization variables, which we refer to as antithetic Gaussian randomization. In theory, we show that this correlation is crucial in ensuring that the variance of our estimator remains bounded while allowing the bias to vanish. Through simulations on various data types and loss functions, we highlight the advantages of our antithetic Gaussian randomization scheme over both independent randomization and standard cross-validation, where the bias-variance tradeoff depends heavily on the number of folds.
