RandALO: Out-of-sample risk estimation in no time flat
Parth Nobel, Daniel LeJeune, Emmanuel J. Candès
TL;DR
RandALO addresses the costly problem of estimating out-of-sample risk in high-dimensional settings by marrying approximate leave-one-out (ALO) with randomized diagonal estimation of the Jacobian. It proves asymptotic normality for the randomized diagonal estimator under elliptical sub-exponential data and introduces an MMSE-based debiasing step to handle inversion sensitivity, plus a risk-inflation correction via multi-m extrapolation. The approach reduces the number of required Jacobian--vector products to a problem-size‑independent constant and yields risk estimates with substantially lower bias than $K$-fold CV at comparable or lower computational cost. Empirically, RandALO matches or surpasses CV across a wide range of linear models and real datasets, enabling fast, reliable hyperparameter tuning at scale, and it is available as an open-source Python package.
Abstract
Estimating out-of-sample risk for models trained on large high-dimensional datasets is an expensive but essential part of the machine learning process, enabling practitioners to optimally tune hyperparameters. Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias ($K$-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than $K$-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.
