Table of Contents
Fetching ...

Choosing the Right Regularizer for Applied ML: Simulation Benchmarks of Popular Scikit-learn Regularization Frameworks

Benjamin S. Knight, Ahsaas Bajaj

Abstract

This study surveys the historical development of regularization, tracing its evolution from stepwise regression in the 1960s to recent advancements in formal error control, structured penalties for non-independent features, Bayesian methods, and l0-based regularization (among other techniques). We empirically evaluate the performance of four canonical frameworks -- Ridge, Lasso, ElasticNet, and Post-Lasso OLS -- across 134,400 simulations spanning a 7-dimensional manifold grounded in eight production-grade machine learning models. Our findings demonstrate that for prediction accuracy when the sample-to-feature ratio is sufficient (n/p >= 78), Ridge, Lasso, and ElasticNet are nearly interchangeable. However, we find that Lasso recall is highly fragile under multicollinearity; at high condition numbers (kappa) and low SNR, Lasso recall collapses to 0.18 while ElasticNet maintains 0.93. Consequently, we advise practitioners against using Lasso or Post-Lasso OLS at high kappa with small sample sizes. The analysis concludes with an objective-driven decision guide to assist machine learning engineers in selecting the optimal scikit-learn-supported framework based on observable feature space attributes.

Choosing the Right Regularizer for Applied ML: Simulation Benchmarks of Popular Scikit-learn Regularization Frameworks

Abstract

This study surveys the historical development of regularization, tracing its evolution from stepwise regression in the 1960s to recent advancements in formal error control, structured penalties for non-independent features, Bayesian methods, and l0-based regularization (among other techniques). We empirically evaluate the performance of four canonical frameworks -- Ridge, Lasso, ElasticNet, and Post-Lasso OLS -- across 134,400 simulations spanning a 7-dimensional manifold grounded in eight production-grade machine learning models. Our findings demonstrate that for prediction accuracy when the sample-to-feature ratio is sufficient (n/p >= 78), Ridge, Lasso, and ElasticNet are nearly interchangeable. However, we find that Lasso recall is highly fragile under multicollinearity; at high condition numbers (kappa) and low SNR, Lasso recall collapses to 0.18 while ElasticNet maintains 0.93. Consequently, we advise practitioners against using Lasso or Post-Lasso OLS at high kappa with small sample sizes. The analysis concludes with an objective-driven decision guide to assist machine learning engineers in selecting the optimal scikit-learn-supported framework based on observable feature space attributes.

Paper Structure

This paper contains 49 sections, 14 equations, 26 figures, 8 tables, 4 algorithms.

Figures (26)

  • Figure 1: The three norms as applied to regularization. Regularization using the $\ell_0$ norm (solid line) performs discrete variable selection, keeping only the larger coefficient while setting the smaller one to zero. Lasso regularization (thick dashed line) via the $\ell_1$ norm nudges $\hat{\beta}$ towards the axes, thus tending towards partial sparsity with some shrinkage of both coefficients. Ridge regression (dotted circular line) uses the $\ell_2$ norm, shrinking both coefficients proportionally toward the origin while maintaining their relative magnitudes.
  • Figure 2: Eigenvalue dispersion distributions. We draw eigenvalues from one of two distributions: low dispersion (Pareto $\alpha$=2.0, $\kappa\approx10$) versus high dispersion (Log-Normal $\mu$=-2.0, $\sigma$=2.5, $\kappa\approx10^{5-6}$).
  • Figure 3: $\beta$ values are drawn from either a Gamma or a Uniform distribution for a total of 5 distribution / parameter combinations.
  • Figure 5: We selected a wide range of potential $\alpha$ values to accommodate both RidgeCV and LassoCV. Note how the optimal L1 parameter tends to fall at the extremes (i.e. 0.0 or 1.0).
  • Figure 6: The precision with which Lasso is able recover the true, non-zero $\beta$ coefficients is a function of the signal-to-noise ratio within the feature set.
  • ...and 21 more figures