RandALO: Out-of-sample risk estimation in no time flat

Parth Nobel; Daniel LeJeune; Emmanuel J. Candès

RandALO: Out-of-sample risk estimation in no time flat

Parth Nobel, Daniel LeJeune, Emmanuel J. Candès

TL;DR

RandALO addresses the costly problem of estimating out-of-sample risk in high-dimensional settings by marrying approximate leave-one-out (ALO) with randomized diagonal estimation of the Jacobian. It proves asymptotic normality for the randomized diagonal estimator under elliptical sub-exponential data and introduces an MMSE-based debiasing step to handle inversion sensitivity, plus a risk-inflation correction via multi-m extrapolation. The approach reduces the number of required Jacobian--vector products to a problem-size‑independent constant and yields risk estimates with substantially lower bias than $K$-fold CV at comparable or lower computational cost. Empirically, RandALO matches or surpasses CV across a wide range of linear models and real datasets, enabling fast, reliable hyperparameter tuning at scale, and it is available as an open-source Python package.

Abstract

Estimating out-of-sample risk for models trained on large high-dimensional datasets is an expensive but essential part of the machine learning process, enabling practitioners to optimally tune hyperparameters. Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias ($K$-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than $K$-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.

RandALO: Out-of-sample risk estimation in no time flat

TL;DR

-fold CV at comparable or lower computational cost. Empirically, RandALO matches or surpasses CV across a wide range of linear models and real datasets, enabling fast, reliable hyperparameter tuning at scale, and it is available as an open-source Python package.

Abstract

-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than

-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.

Paper Structure (28 sections, 9 theorems, 57 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 28 sections, 9 theorems, 57 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Contributions.
Related work
Approximate leave-one-out for linear models
Randomized approximate leave-one-out
Dealing with noise: Inversion sensitivity
Dealing with noise: Risk inflation debiasing
Computing Jacobian--vector products
Alternative approaches
Numerical Experiments
RandALO implementation.
Jacobian--vector product implementation.
Machine learning implementation.
Hyperparameter selection.
Risk metrics.
...and 13 more sections

Key Result

Theorem 1

Let $\widetilde{\mathbf{J}} = \mathbf{X} \left( \mathbf{X}^\top \mathbf{X} + \mathbf{G} \right)^{-1} \mathbf{X}^\top$ for $\mathbf{X} = \mathbf{T}^{1/2} \mathbf{Z} {\bm{\Sigma}}^{1/2}$, where $\mathbf{T} \in \mathbb{R}^{n \times n}$ is a diagonal matrix with positive diagonal elements $t_i$ in a fin where $\eta = \mathrm{tr} [{\bm{\Sigma}} \left( \mathbf{X}^\top \mathbf{X} + \mathbf{G} \right)^{-1

Figures (10)

Figure 1: $K$-fold cross-validation (CV, solid blue, circles) provides a poor trade-off between risk estimation error and computational time on a high-dimensional lasso problem. Meanwhile, BKS-ALO (dashed orange, squares), a simplified version of our method, dominates CV in estimation bias and computational cost. Our fully debiased procedure RandALO (dash--dot green, triangles) goes further and reduces bias by an order of magnitude for the same computational cost, and both methods reach the same bias as exact ALO (red diamond) in a fraction of the time. Lines denote mean risk estimate bias and time over 100 trials. We report the relative risk estimation bias is computed as $|\hat{R} - R| / R$ for a particular mean risk estimate $\hat{R}$, where the true risk $R$ is estimated as the sample mean of the conditional risks given the training data. The $y$-axis is logarithmic above the true conditional risk standard error of $0.122\%$ (dotted, black) and linear below.
Figure 2: \ref{['thm:bks-clt']} provides a very accurate characterization of randomized diagonal estimation even for a fairly small problem with $n = 200$ and $p = 150$. Left: The empirical distribution of $\mu_i$ for $m=10$ over 1000 trials (with randomness only over the vectors $\mathbf{w}_k$) exactly matches the mixture of Gaussians centered at each $\tilde{J}_{ii}$ predicted by \ref{['thm:bks-clt']}. Middle: Taking the $z$-scores of the individual Jacobian--vector products $(\widetilde{\mathbf{J}} \mathbf{w}_k) \odot \mathbf{w}_k$ from the same experiment, the empirical distribution is well described by the standard normal. Right: Looking at the $z$-scores for a single Jacobian--vector product, the pairs of successive elements of the resulting vector are uncorrelated as predicted by the asymptotics.
Figure 3: Left: Minimum mean squared error (MMSE) estimation using a uniform prior on $\tilde{J}_{ii}$ (orange) provides a much better estimate than the naïve maximum likelihood estimate (MLE) $\mu_i$ (blue), and is meaningful even when $\mu_i \notin [0, 1]$. Right: We plug in our diagonal estimates into the formula $\tilde{J}_{ii}/(1 - \tilde{J}_{ii})$ for a ridge regression problem with $n = p = 100$, $\lambda = 0.1$, and $\mathbf{x}_i = t_i \mathbf{z}_i$ for $t_i \sim \mathrm{Uniform}[\tfrac{1}{2}, 1]$ and $\mathbf{z}_i \sim \mathcal{N}({\bf 0}, \mathbf{I})$. Direct application of $\mu_i$ for $m = 50$ provides poor estimates when $\tilde{J}_{ii}$ is close to 1 (blue circles), and sometimes yields nonsense results when $\mu_i > 1$ (red diamonds) which are poorly addressed by clipping. Meanwhile, the truncated normal MMSE strategy (orange $\times$'s) controls the effect of noise on inversion.
Figure 4: Left: Although the plug-in estimation of ALO using BKS diagonal estimation (blue, dashed) is significantly biased, by evaluating the plug-in BKS risk estimate with subsampled Jacobian--vector products (dots), we can obtain high quality debiased estimates (triangle, star) of ALO using a linear regression (solid lines). Right: Our complete procedure RandALO (orange, solid) converges very quickly in $m$ to the limiting ALO risk estimate, which provides an accurate estimate of test error (black, dotted). It converges significantly more quickly than the naïve plug-in BKS estimate (blue, dashed). Lines and shaded areas denote mean and standard deviation over 100 random trials of a lasso problem with $n = p = 5000$ described in \ref{['sec:num:lasso']}
Figure 5: Left: For a lasso problem in proportionally high dimensions $p = n$, CV suffers from bias that does not vanish with $n$ even as risk concentrates. Meanwhile, even BKS-ALO with its biased risk estimate at only $m = 50$ Jacobian--vector products is more accurate than CV at lower computational cost (right). Going further, RandALO removes the bias for the same choice of $m$ with a computational overhead that vanishes as $n$ increases.
...and 5 more figures

Theorems & Definitions (17)

Theorem 1
proof : Proof sketch
Theorem 2
proof
Corollary 3
proof
Corollary 4
proof
Corollary 5
proof
...and 7 more

RandALO: Out-of-sample risk estimation in no time flat

TL;DR

Abstract

RandALO: Out-of-sample risk estimation in no time flat

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (17)