Risk and cross validation in ridge regression with correlated samples

Alexander Atanasov; Jacob A. Zavatone-Veth; Cengiz Pehlevan

Risk and cross validation in ridge regression with correlated samples

Alexander Atanasov, Jacob A. Zavatone-Veth, Cengiz Pehlevan

TL;DR

This paper addresses the gap that existing high-dimensional ridge regression theory often assumes independent training samples. By leveraging random matrix theory and free probability, it derives sharp deterministic equivalents for correlated designs, and shows that standard GCV fails under correlations while introducing CorrGCV, an unbiased, asymptotically exact correction for matched covariate-noise correlations. The authors extend the analysis to test points with nontrivial correlations (time-series forecasting) and covariate shift, providing precise bias-variance decompositions and a detailed algorithmic pathway to implement CorrGCV. Empirical results validate the theory across diverse correlated data, demonstrating improved out-of-sample risk estimation and ridge parameter tuning, with implications for time-series modeling and other domains with structured data.

Abstract

Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.

Risk and cross validation in ridge regression with correlated samples

TL;DR

Abstract

Paper Structure (49 sections, 14 theorems, 212 equations, 20 figures)

This paper contains 49 sections, 14 theorems, 212 equations, 20 figures.

Setup and notation
Deterministic equivalences
Weak deterministic equivalents
One-point strong deterministic equivalents
Two-point strong deterministic equivalents
Predicting an uncorrelated test set
Warm-up: Linear regression without correlations
Correlated data with identically correlated noise
Mismatched correlations and OOD generalization
Effect on double descent
Algorithmic implementation
Testing on correlated data
Conclusion
Algorithmic implementation of the CorrGCV
Code for implementation
...and 34 more sections

Key Result

Lemma 2.2

Let $\kappa$ and $\tilde{\kappa}$ be as in eq:weak_det_equiv_1. Then,

Figures (20)

Figure 1: Empirical risk $\hat{R}$, out-of-sample risk $R$, and fine-grained bias-variance decompositions for ridge regression with structured features and correlated examples. Theory is plotted in solid lines. Experiments with error bars over 10 dataset repetitions are plotted as markers. The data points are exponentially correlated as $\mathbb E[\bm x_t \cdot \bm x_s] \propto e^{-|t - s|/\xi}$. Left: Weak correlations, $\xi = 10^{-2}$. Here, the generalized cross validation method (orchid) as well as its other proposed corrections in the presence of correlations (pink, purple) all agree and are overlaid. Right: Strong correlations, $\xi = 10^{2}$. Here, we see that the naive estimates of the GCV proposed in prior works fail in this setting. They either underestimate (purple) or overestimate (pink) the out-of-sample risk. We define the naive GCVs in the text, and connect them with prior proposals in Appendix \ref{['app:previous_estimators']}. By contrast, our proposed estimator, CorrGCV, correctly predicts the out-of-sample risk in all settings.
Figure 2: Estimating the optimal ridge parameter for exponential correlations using the CorrGCV. The setup here is as in Figure \ref{['fig:motivation']}. We see that only the CorrGCV accurately predicts the out-of-sample risk, and thus is the only estimator that allows one to correctly pinpoint the optimal ridge parameter $\lambda$.
Figure 3: Power law scalings for data with a) exponential correlations with $\xi = 10^{2}$ and b) power law correlations $\mathbb E \bm x_{t}^{\top} \bm x_{t+\tau} \propto \tau^{-\chi}$ with $\chi = 0.3$. In both cases, the correlations of the data do not affect the scaling of the generalization error as a function of $T$, which generally goes as $T^{-2 \alpha \min(r,1)}$, as derived in prior works. Although other estimators correctly predict the rate of decay, only the CorrGCV correctly recovers the exact risk.
Figure 4: Precise asymptotics for double-descent in linear regression with unstructured data across various correlations. We choose an exponential correlation with correlation length $\xi$ and vary $\xi$. a) Weakly correlated data and noise, giving rise to the traditional double descent curve as analyzed in advani2020highhastie2022surprises. All GCV-related estimators agree and correctly estimate the out-of-sample risk. b) Strongly correlated data with matched noise correlations. The double descent peak is mollified. c) Strongly correlated data but uncorrelated noise. The double descent peak is exacerbated. This mismatch in correlations violates the assumptions of the CorrGCV, and thus no GCV can asymptotically match it without knowledge of the noise level $\sigma_\epsilon$. Across all settings the theory curves (solid lines) find excellent agreement with the experiments (solid markers with error bars over 10 different datasets).
Figure 5: A graphical representation of the program needed to obtain the CorrGCV empirically from a given dataset. The asymmetry of the diagram arises from the fact that we estimate $\kappa$ first rather than $\tilde{\kappa}$. This is because it is more reasonable to assume a good estimate of the correlations $\bm K$, which often have properties such as stationarity that improve the estimation process, compared to estimating $\bm \Sigma$. As a result, it is easier to use either an exact form or a differentiable interpolation of $S_{\bm K}(\mathrm{df})$ rather than $S_{\bm \Sigma}$.
...and 15 more figures

Theorems & Definitions (29)

Definition 2.1: Strong deterministic equivalence
Lemma 2.2
proof
Lemma 2.3
proof
Lemma 2.4
proof
Theorem 3.1
proof
Theorem 3.2
...and 19 more

Risk and cross validation in ridge regression with correlated samples

TL;DR

Abstract

Risk and cross validation in ridge regression with correlated samples

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (29)