Optimal Ridge Regularization for Out-of-Distribution Prediction

Pratik Patil; Jin-Hong Du; Ryan J. Tibshirani

Optimal Ridge Regularization for Out-of-Distribution Prediction

Pratik Patil, Jin-Hong Du, Ryan J. Tibshirani

TL;DR

The paper tackles the problem of optimal ridge regularization for out-of-distribution prediction, revealing that the best penalty can be negative under covariate or regression shifts and that the optimal OOD risk remains monotone with respect to data aspect ratio and SNR. By deriving deterministic equivalents for the OOD ridge risk and introducing fixed-point quantities that capture self-induced regularization, the authors establish general alignment-based conditions determining the sign of the optimal penalty and extend monotonicity results beyond in-distribution settings. They also connect regularization to subsampling ensembles, showing when ridgeless ensembles suffice and when negative regularization is necessary to achieve the best risk. The results hold under broad moment assumptions without relying on a fixed train/test distribution, providing insights into the behavior of ridge regression under arbitrary shifts with potential practical implications for interpolation regimes and real-world data shifts.

Abstract

We study the behavior of optimal ridge regularization and optimal ridge risk for out-of-distribution prediction, where the test distribution deviates arbitrarily from the train distribution. We establish general conditions that determine the sign of the optimal regularization level under covariate and regression shifts. These conditions capture the alignment between the covariance and signal structures in the train and test data and reveal stark differences compared to the in-distribution setting. For example, a negative regularization level can be optimal under covariate shift or regression shift, even when the training features are isotropic or the design is underparameterized. Furthermore, we prove that the optimally-tuned risk is monotonic in the data aspect ratio, even in the out-of-distribution setting and when optimizing over negative regularization levels. In general, our results do not make any modeling assumptions for the train or the test distributions, except for moment bounds, and allow for arbitrary shifts and the widest possible range of (negative) regularization levels.

Optimal Ridge Regularization for Out-of-Distribution Prediction

TL;DR

Abstract

Paper Structure (81 sections, 26 theorems, 166 equations, 14 figures, 6 tables)

This paper contains 81 sections, 26 theorems, 166 equations, 14 figures, 6 tables.

Introduction
Summary of Results and Paper Outline
Related Work and Comparisons
Out-of-Distribution Risk Asymptotics
Data Assumptions
Out-of-Distribution Risk Asymptotics
Properties of Optimal Regularization
In-Distribution Optimal Regularization
Out-of-Distribution Optimal Regularization
Properties of Optimal Risk
Optimal Risk Monotonicity
Connection to Subsampling and Ensembling
Discussion
Organization and Notation
Organization
...and 66 more sections

Key Result

Proposition 2

Under asm:train-test, as $n,p\rightarrow\infty$ such that $p/n\rightarrow\phi\in(0,\infty)$ and $\lambda\in(\lambda_{\min}(\phi),\infty)$, the prediction risk $R(\widehat{\bm{\beta}}^\lambda)$ defined in eq:ridge_prederr admits a deterministic equivalent $R(\widehat{\bm{\beta}}^{\lambda}) \simeq \ma with the following deterministic equivalents for the bias, variance, regression shift bias, and irr

Figures (14)

Figure 1: Illustration of negative or positive optimal regularization under general alignment. We plot the in-distribution risk of ridge regression against the penalty $\lambda$ for varying data aspect ratios $\phi$ in the overparameterized regime. The left and right panels correspond to scenarios when snr is high ($\sigma^2=0.01$) and low ($\sigma^2=1$), respectively. The data model has a covariance matrix $(\bm{\Sigma}_{\mathrm{ar1}})_{ij}:= \rho_{\mathrm{ar1}}^{|i-j|}$ with parameter $\rho_{\mathrm{ar1}}=0.5$, and a coefficient $\bm{\beta}:=\frac{1}{2}(\bm{w}_{(1)} + \bm{w}_{(p)})$, where $\bm{w}_{(j)}$ is the $j$th eigenvector of $\bm{\Sigma}_{\mathrm{ar1}}$.
Figure 2: Covariate and regression shift can lead to negative optimal regularization in both underparameterized and overparameterized regimes. The plot shows the in-distribution and OOD risks against $\lambda$ in the high snr setting ($\sigma^2=0.01$ and $\sigma_0^2=0$). The left panel shows the overparameterized regime ($\phi=1.5$) where the optimal ridge penalty $\lambda^*$ is negative under covariate shift, when $\bm{\Sigma}=\bm{I}$, $\bm{\Sigma}_0=\bm{\Sigma}_{\mathrm{ar1}}$, and $\bm{\beta}=\bm{\beta}_0=\frac{1}{2}(\bm{w}_{(1)} + \bm{w}_{(p)})$. The right panel shows the underparameterized regime ($\phi=0.5$) where the optimal ridge penalty $\lambda^*$ is negative under regression shift, when $\bm{\Sigma}=\bm{\Sigma}_0=\bm{\Sigma}_{\mathrm{ar1}}$, $\bm{\beta}=\frac{1}{2}(\bm{w}_{(1)} + \bm{w}_{(p)})$, and $\bm{\beta}_0=2\bm{\beta}$.
Figure 3: Ridge regression optimized over $\lambda\geq\nu$ for different thresholds $\nu$ has monotonic risk profile. We showcase the prediction risk of optimal ridge regression under the same data model as in \ref{['fig:negative_optimal_our_condition']}, with $\sigma^2=0.01$. The left panel shows the heatmap of the risks $\mathscr{R}(\lambda, \phi)$ of ridge regression for different ridge penalties $\lambda$ and data aspect ratios $\phi$. The lines indicate the optimized ridge risks $\min_{\lambda\geq \nu}\mathscr{R}(\lambda, \phi)$ at different thresholds $\nu$. The right panel shows the optimized risk $\min_{\lambda\geq \nu}\mathscr{R}(\lambda, \phi)$ as a function of $\phi$.
Figure 4: Effect of distribution shift on the risk monotonicity behavior of optimal ridge on MNIST. The figure illustrates the risk profile (against the training sample size) of optimal ridge regression on the MNIST dataset when subjected to different types of distribution shifts. We follow the same setup as for \ref{['tab:MNIST']} (see \ref{['sec:real_data_illustration']} for more details) and vary the number of training sample size $n$ from 25 to 200, and inspect the OOD prediction risk of the optimal ridge predictor. Different colors represent different types of shift from less severe (Type 1) to more severe (Type 5). The y-axis represents the out-of-distribution prediction risk for the task of accurately predicting the digit value for unseen images. The figure shows a clear pattern where the optimal ridge exhibits a monotonically decreasing risk in the training sample size $n$.
Figure 5: Negative regularization can help achieve optimal risk in both underparameterized and overparameterized regimes. The heatmap illustrates the prediction risks for ridge regression as a function of the ridge penalty $\lambda$ and subsample aspect ratio $\psi$ in the full ensemble. We use the same data model as \ref{['fig:optimal-risk']} with $\sigma^2=0.01$. The left and right panels show the underparameterized ($\phi=0.5$) and overparameterized regimes ($\phi=2$), respectively. The red paths represent the optimal risks, while the blue and green stars indicate the optimal ridge predictor and the optimal full-ensemble ridge with the largest subsample aspect ratio.
...and 9 more figures

Theorems & Definitions (29)

Definition 1: General feature and response distribution
Definition 2: Lower bound on negative regularization
Proposition 2: Deterministic equivalents for OOD risk
Theorem 3: Optimal regularization sign for in-distribution risk
Proposition 3: Optimal regularization under covariate shift and random signal
Theorem 4: Optimal regularization under covariate shift and deterministic signal
Theorem 5: Optimal regularization under regression shift
Proposition 5: Optimal risk under isotropic signals
Theorem 6: Monotonicity of optimally tuned OOD risk
Theorem 7: Non-monotonicity of suboptimally tuned risk
...and 19 more

Optimal Ridge Regularization for Out-of-Distribution Prediction

TL;DR

Abstract

Optimal Ridge Regularization for Out-of-Distribution Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (29)