Prevalidated ridge regression is a highly-efficient drop-in replacement for logistic regression for high-dimensional data
Angus Dempster, Geoffrey I. Webb, Daniel F. Schmidt
TL;DR
Logistic regression requires careful hyperparameter tuning and can be computationally intensive in high dimensions, while ridge regression is fast but yields nonprobabilistic outputs. The authors propose PreVal, a ridge-based classifier whose coefficients are scaled by κ to minimise log-loss on prevalidated LOOCV predictions, effectively producing probabilistic outputs with minimal hyperparameter overhead. Through SVD preprocessing and a joint optimization over κ and λ, PreVal matches the predictive performance of regularised LR (0–1 loss and log-loss) across 273 high-dimensional datasets while achieving substantial computational speedups (up to 1000× in some settings). This makes PreVal a practical drop-in replacement for LR in applications with large p, offering efficient probabilistic predictions without extensive cross-validation or tuning.
Abstract
Logistic regression is a ubiquitous method for probabilistic classification. However, the effectiveness of logistic regression depends upon careful and relatively computationally expensive tuning, especially for the regularisation hyperparameter, and especially in the context of high-dimensional data. We present a prevalidated ridge regression model that closely matches logistic regression in terms of classification error and log-loss, particularly for high-dimensional data, while being significantly more computationally efficient and having effectively no hyperparameters beyond regularisation. We scale the coefficients of the model so as to minimise log-loss for a set of prevalidated predictions derived from the estimated leave-one-out cross-validation error. This exploits quantities already computed in the course of fitting the ridge regression model in order to find the scaling parameter with nominal additional computational expense.
