Table of Contents
Fetching ...

Ridge Regularization: an Essential Concept in Data Science

Trevor Hastie

TL;DR

Some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics are collected together.

Abstract

Ridge or more formally $\ell_2$ regularization shows up in many areas of statistics and machine learning. It is one of those essential devices that any good data scientist needs to master for their craft. In this brief ridge fest I have collected together some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics.

Ridge Regularization: an Essential Concept in Data Science

TL;DR

Some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics are collected together.

Abstract

Ridge or more formally regularization shows up in many areas of statistics and machine learning. It is one of those essential devices that any good data scientist needs to master for their craft. In this brief ridge fest I have collected together some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics.

Paper Structure

This paper contains 10 sections, 43 equations, 5 figures.

Figures (5)

  • Figure 1: Constraint balls for ridge, lasso and elastic-net regularization. The sharp edges and corners of the latter two allow for variable selection as well as shrinkage.
  • Figure 2: Simulation from a linear model with $n=100$, $p=54$ and $SNR=3.3$. [Left panel] Coefficients profiles $\hat{\beta}_\lambda$ versus $\log\lambda$. The OLS coefficients are on the far left. The left vertical broken line is at the optimal EPE, and the red bars are the true coefficients. The second vertical line corresponds to the minimum LOO CV error. The dashes on the right axis are the James-Stein (uniformly shrunk) estimates. [Right panel] The EPE of the fitted model on an infinite test data set (orange): $EPE(\lambda)=\sigma^2+E_X(\hat{f}_\lambda(X)-f(X))^2$, and the LOO CV curve estimated from the 100 training points.
  • Figure 3: Examples of data augmentation. [Left] A mass of fake points at the origin fattens the data cloud, and stabilizes the coefficient estimates in the direction of the smaller principal axis. [Right] Many perturbed versions of the original data points are presented, all with the same response. If the perturbations are zero-mean, scalar covariance with the right scalar, the result is approximate ridge.
  • Figure 4: Double descent in the generalization error of minimum-norm estimation as the dimension increases.
  • Figure 5: Singular values for the natural spline bases on the training data, as the dimension increases from 100 (orange) to 270 (purple).