Table of Contents
Fetching ...

Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization

Dmitry Kobak, Jonathan Lomond, Benoit Sanchez

Abstract

A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. Here we show that this rule can be violated by linear regression in the underdetermined $n\ll p$ situation under realistic conditions. Using simulations and real-life high-dimensional data sets, we demonstrate that an explicit positive ridge penalty can fail to provide any improvement over the minimum-norm least squares estimator. Moreover, the optimal value of ridge penalty in this situation can be negative. This happens when the high-variance directions in the predictor space can predict the response variable, which is often the case in the real-world high-dimensional data. In this regime, low-variance directions provide an implicit ridge regularization and can make any further positive ridge penalty detrimental. We prove that augmenting any linear model with random covariates and using minimum-norm estimator is asymptotically equivalent to adding the ridge penalty. We use a spiked covariance model as an analytically tractable example and prove that the optimal ridge penalty in this case is negative when $n\ll p$.

Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization

Abstract

A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. Here we show that this rule can be violated by linear regression in the underdetermined situation under realistic conditions. Using simulations and real-life high-dimensional data sets, we demonstrate that an explicit positive ridge penalty can fail to provide any improvement over the minimum-norm least squares estimator. Moreover, the optimal value of ridge penalty in this situation can be negative. This happens when the high-variance directions in the predictor space can predict the response variable, which is often the case in the real-world high-dimensional data. In this regime, low-variance directions provide an implicit ridge regularization and can make any further positive ridge penalty detrimental. We prove that augmenting any linear model with random covariates and using minimum-norm estimator is asymptotically equivalent to adding the ridge penalty. We use a spiked covariance model as an analytically tractable example and prove that the optimal ridge penalty in this case is negative when .

Paper Structure

This paper contains 9 sections, 1 theorem, 27 equations, 6 figures.

Key Result

Theorem \oldthetheorem

Let $\boldsymbol{\hat{\beta}}_\lambda$ be a ridge estimator of ${\boldsymbol\beta}\in\mathbb R^p$ in a linear model $y= \mathbf x^\top {\boldsymbol\beta} + \varepsilon$, given some training data $(\mathbf X, \mathbf y)$ and some value of $\lambda$. We construct a new estimator $\boldsymbol{\hat{\bet In addition, for any given $\mathbf x$, let $\hat{y}_\lambda=\mathbf x^\top \boldsymbol{\hat{\beta}

Figures (6)

  • Figure 1: Cross-validation estimate of ridge regression performance for the liver.toxicity dataset. a. Using $p=50$ random predictors. b. Using all $p=3116$ predictors. Lines correspond to 10 dependent variables. Dots show minimum values.
  • Figure 2: a--d. Expected normalized MSE of ridge estimators using a model with correlated predictors. On all subplots $n=64$. Subplots correspond to the number of predictors $p$ taking values 50, 75, 150, and 1000. Dots mark the points with minimum risk. e. Expected normalized MSE of OLS (for $n<p$) and minimum-norm OLS (for $p>n$) estimators using the same model with $p\in[10,1000]$. Dots mark the dimensionalities corresponding to subplots (a--d). Dashed line: the expected normalized MSE of the optimal ridge estimator. f. The values of $\lambda$ minimizing the expected risk. For $p\gtrsim 600$, the optimal value of ridge penalty was negative: $\lambda_\mathrm{opt}<0$. f. Expected normalized MSE of ridge estimators for $p=1000$ including negative values of $\lambda$. The minimum was attained at $\lambda_\mathrm{opt}=-150$.
  • Figure 3: a. The optimal regularization parameter $\lambda_\mathrm{opt}$ as a function of sample size ($n$) and dimensionality ($p$) in the model with uncorrelated predictors ($\rho=0$). In this case $\lambda_\mathrm{opt}=p\sigma^2/\lVert\boldsymbol\beta\rVert=p/\alpha$. Black line corresponds to $n=p$. b. The optimal regularization parameter $\lambda_\mathrm{opt}$ in the model with correlated predictors ($\rho=0.1$).
  • Figure 4: a. Expected MSE as a function of ridge penalty in the toy model with $p=50$ weakly correlated predictors that are all weakly correlated with the response ($n=64$). This is the same plot as in Figure \ref{['fig:model']}a. The dot denotes minimal risk and the square denotes the MSE of the OLS estimator ($\lambda=0$). The horizontal line shows the optimal risk corresponding to $\lambda_\mathrm{opt}$. b. Augmenting the model with up to $q=400$ random predictors with variance $\lambda_\mathrm{opt}/q$. Solid line corresponds to $\boldsymbol{\hat{\beta}}_q$ (i.e. $\boldsymbol{\hat{\beta}}_\mathrm{augm}$ truncated to $p$ predictors); dashed line corresponds to the full $\boldsymbol{\hat{\beta}}_\mathrm{augm}$. c. Augmenting the model with up to $q=400$ random predictors with variance equal to 1. d. The optimal ridge penalty $\lambda_\mathrm{opt}$ in the model augmented with random predictors with adaptive variance, as in panel (b). e. The optimal ridge penalty $\lambda_\mathrm{opt}$ in the model augemented with random predictors with variance 1, as in panel (c).
  • Figure 5: a. The derivative of the expected risk as a function of ridge penalty $\lambda$ at $\lambda=0$, in the model with $p$ weakly correlated predictors. Sample size $n=64$. b. Zoom-in into panel (a). The derivative becomes positive for $p\gtrsim 600$, implying that $\lambda_\mathrm{opt}<0$.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem \oldthetheorem