Optimal Ridge Regularization for Out-of-Distribution Prediction
Pratik Patil, Jin-Hong Du, Ryan J. Tibshirani
TL;DR
The paper tackles the problem of optimal ridge regularization for out-of-distribution prediction, revealing that the best penalty can be negative under covariate or regression shifts and that the optimal OOD risk remains monotone with respect to data aspect ratio and SNR. By deriving deterministic equivalents for the OOD ridge risk and introducing fixed-point quantities that capture self-induced regularization, the authors establish general alignment-based conditions determining the sign of the optimal penalty and extend monotonicity results beyond in-distribution settings. They also connect regularization to subsampling ensembles, showing when ridgeless ensembles suffice and when negative regularization is necessary to achieve the best risk. The results hold under broad moment assumptions without relying on a fixed train/test distribution, providing insights into the behavior of ridge regression under arbitrary shifts with potential practical implications for interpolation regimes and real-world data shifts.
Abstract
We study the behavior of optimal ridge regularization and optimal ridge risk for out-of-distribution prediction, where the test distribution deviates arbitrarily from the train distribution. We establish general conditions that determine the sign of the optimal regularization level under covariate and regression shifts. These conditions capture the alignment between the covariance and signal structures in the train and test data and reveal stark differences compared to the in-distribution setting. For example, a negative regularization level can be optimal under covariate shift or regression shift, even when the training features are isotropic or the design is underparameterized. Furthermore, we prove that the optimally-tuned risk is monotonic in the data aspect ratio, even in the out-of-distribution setting and when optimizing over negative regularization levels. In general, our results do not make any modeling assumptions for the train or the test distributions, except for moment bounds, and allow for arbitrary shifts and the widest possible range of (negative) regularization levels.
