Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning

Pratik Patil; Daniel LeJeune

Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning

Pratik Patil, Daniel LeJeune

TL;DR

This work employs random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles, enabling efficient and consistent tuning of regularization and sketching parameters, and proposes an "ensemble trick" whereby the risk for unsketched Ridge regression can be efficiently estimated via GCV using small sketched Ridge ensemble.

Abstract

We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles, enabling efficient and consistent tuning of regularization and sketching parameters. Our results hold for a broad class of asymptotically free sketches under very mild data assumptions. For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally optimized by only tuning sketch size in infinite ensembles. For general subquadratic prediction risk functionals, we extend GCV to construct consistent risk estimators, and thereby obtain distributional convergence of the GCV-corrected predictions in Wasserstein-2 metric. This in particular allows construction of prediction intervals with asymptotically correct coverage conditional on the training data. We also propose an "ensemble trick" whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles. We empirically validate our theoretical results using both synthetic and real large-scale datasets with practical sketches including CountSketch and subsampled randomized discrete cosine transforms.

Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning

TL;DR

Abstract

Paper Structure (56 sections, 16 theorems, 114 equations, 7 figures, 5 tables)

This paper contains 56 sections, 16 theorems, 114 equations, 7 figures, 5 tables.

Introduction
Summary of results and outline
Related work
Sketched ensembles
Sketched ensembles and risk functionals.
Proposed GCV plug-in estimators.
Squared risk asymptotics and consistency
Asymptotically free sketching
Asymptotic decompositions and consistency
General functional consistency
Tuning applications and theoretical implications
Discussion
Background on asymptotic freeness and free sketching support
Free probability theory
Asymptotic freeness
...and 41 more sections

Key Result

Theorem 1

Under cond:sketch, for all $\lambda > \lambda_0$, where ${\mu > -\lambda_{\min}^{+}(\widehat{{\bm{\Sigma}}})}$ is increasing in $\lambda > \lambda_0$ and satisfies

Figures (7)

Figure 1: GCV provides consistent risk estimation for sketched ridge regression. We show squared risk (solid) and GCV estimates (dashed) for sketched regression ensembles of $K = 5$ predictors on synthetic data with $n = 500$ observations and $p = 600$ features. Left: Each sketch induces its own risk curve in regularization strength $\lambda$, but across all sketches GCV is consistent. Middle: Minimizers and minimum values can vary by sketching type. Right: Each sketch also induces a risk curve in sketch size $\alpha = q/p$, so sketch size can be tuned to optimize risk. Error bars denote standard error of the mean over 100 trials. Here, SRDCT refers to a subsampled randomized discrete cosine transform (see \ref{['sec:experimental-details']} for further details).
Figure 2: GCV provides very accurate risk estimates for real-world data. We fit ridge regression ensembles of size $K = 5$ using CountSketch charikar2002frequent on binary $\pm 1$ labels from RCV1 lewis2004rcv1 ($n = 20000$, $p = 30617$, $q = 515$) (left) and RNA-Seq weinstein2013tcga ($n = 356$, $p = 20223$, $q = 99$) (right). GCV (dashed, circles) matches test risk (solid, diamonds) and improves upon 2-fold CV (dotted) for both squared error (blue, green) and classification error (orange, red). CV provides poorer estimates for less positive $\lambda$, heavily exaggerated when $n$ is small such as in RNA-Seq. Error bars denote standard deviation over 10 trials.
Figure 3: GCV provides consistent prediction intervals and distribution estimates.Left: We construct GCV prediction intervals for SRDCT ensembles of size $K = 5$ to synthetic data ($n = 1500$, $p=1000$) with nonlinear responses $y = \mathrm{soft\,threshold}(\mathbf{x}^\top {\bm{\beta}}_0)$. Mid-left: We use GCV to tune our model to optimize prediction interval width. Right: The empirical GCV estimate ${\widehat{P}}_\lambda^\mathrm{ens}$ in \ref{['eq:gcv-dual-empirical-dist']} (here for $\alpha=0.68$) closely matches the true joint response--prediction distribution $P_\lambda^\mathrm{ens}$. Error bars denote standard deviation over 30 trials.
Figure 4: GCV combined with sketching yields a fast method for tuning ridge. We fit SRDCT ensembles on synthetic data ($n = 600$, $p = 800$), sketching features (left and right) or observations (middle). GCV (dashed) provides consistent estimates of test risk (solid) for feature sketching but not for observation sketching. However, the ensemble trick in \ref{['eq:ensemble-trick']} does not depend on the variance and thus works for both. For $\lambda = 0$, each equivalent $\mu > 0$ can be achieved by an appropriate choice of $\alpha$. Error bars denote standard deviation over 50 trials.
Figure 5: Empirical support for asymptotic freeness and subordination relation. Left: We plot the absolute value of the average of the normalized traces of polynomials, which converge to zero. We also plot best fit lines on the log--log scale (dashed). Error bars denote one standard deviation over 10 trials, collected over both polynomials. Right: We numerically compute $\mu$ and plot the empirical subordination relation, which are decreasing continuous functions that closely match the theoretical S-transforms of Gaussian (dashed) for CountSketch ($\times$) and orthogonal (dash--dot) for SRDCT ($\circ$). Each mark in the scatter plots corresponds to a single $(\mathbf{A}, \lambda)$ pair, and we solve for the corresponding $\mu$.
...and 2 more figures

Theorems & Definitions (29)

Theorem 1: Free sketching equivalence; lejeune2022asymptotics, Theorem 7.2
Theorem 2: Risk and GCV asymptotics
Theorem 3: GCV consistency
Theorem 4: Functional consistency
Corollary 5: Distributional consistency
Proposition 6: Optimized GCV versus optimized ridge
Proposition 7: GCV inconsistency for observation sketch
Definition 8: $C^*$-probability space and state
Definition 9: Freeness
Definition 10: Convergence in spectral distribution
...and 19 more

Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning

TL;DR

Abstract

Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (29)