Table of Contents
Fetching ...

Optimal Cross-Validation for Sparse Linear Regression

Ryan Cory-Wright, Andrés Gómez

TL;DR

This work tackles the computational burden of hyperparameter tuning in ridge-regularized sparse linear regression with an $\ell_0$ constraint by developing convex, perspective-relaxation-based bounds for the $k$-fold cross-validation error. These relaxations yield tractable upper and lower bounds that obviate solving MIOs for every fold and parameter, enabling a branch-and-bound scheme and a cyclic coordinate-descent procedure to efficiently optimize $(\tau,\gamma)$. Empirically, the approach reduces the number of MIOs by 50–80% and achieves 10–30% lower cross-validation error compared to grid search with MCP or GLMNet across real datasets, with SP-dominated CV performance in overdetermined regimes but some caveats in underdetermined settings. The proposed framework thus offers a practical path to high-quality sparse models with improved generalization while significantly cutting computational costs, and it generalizes naturally to hold-out validation scenarios.

Abstract

Given a high-dimensional covariate matrix and a response vector, ridge-regularized sparse linear regression selects a subset of features that explains the relationship between covariates and the response in an interpretable manner. To select the sparsity and robustness of linear regressors, techniques like k-fold cross-validation are commonly used for hyperparameter tuning. However, cross-validation substantially increases the computational cost of sparse regression as it requires solving many mixed-integer optimization problems (MIOs) for each hyperparameter combination. To improve upon this state of affairs, we obtain computationally tractable relaxations of k-fold cross-validation metrics, facilitating hyperparameter selection after solving 50-80% fewer MIOs in practice. These relaxations result in an efficient cyclic coordinate descent scheme, achieving 10%-30% lower validation errors than via traditional methods such as grid search with MCP or GLMNet across a suite of 13 real-world datasets.

Optimal Cross-Validation for Sparse Linear Regression

TL;DR

This work tackles the computational burden of hyperparameter tuning in ridge-regularized sparse linear regression with an constraint by developing convex, perspective-relaxation-based bounds for the -fold cross-validation error. These relaxations yield tractable upper and lower bounds that obviate solving MIOs for every fold and parameter, enabling a branch-and-bound scheme and a cyclic coordinate-descent procedure to efficiently optimize . Empirically, the approach reduces the number of MIOs by 50–80% and achieves 10–30% lower cross-validation error compared to grid search with MCP or GLMNet across real datasets, with SP-dominated CV performance in overdetermined regimes but some caveats in underdetermined settings. The proposed framework thus offers a practical path to high-quality sparse models with improved generalization while significantly cutting computational costs, and it generalizes naturally to hold-out validation scenarios.

Abstract

Given a high-dimensional covariate matrix and a response vector, ridge-regularized sparse linear regression selects a subset of features that explains the relationship between covariates and the response in an interpretable manner. To select the sparsity and robustness of linear regressors, techniques like k-fold cross-validation are commonly used for hyperparameter tuning. However, cross-validation substantially increases the computational cost of sparse regression as it requires solving many mixed-integer optimization problems (MIOs) for each hyperparameter combination. To improve upon this state of affairs, we obtain computationally tractable relaxations of k-fold cross-validation metrics, facilitating hyperparameter selection after solving 50-80% fewer MIOs in practice. These relaxations result in an efficient cyclic coordinate descent scheme, achieving 10%-30% lower validation errors than via traditional methods such as grid search with MCP or GLMNet across a suite of 13 real-world datasets.
Paper Structure (23 sections, 7 theorems, 37 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 23 sections, 7 theorems, 37 equations, 3 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Given any $0 < \gamma$ and any bound the inequality holds, where $\bm{\beta}_{MIO}^*$ is an optimal solution of eq:MIPII and $\bm{\beta}_{persp}^*$ is optimal to eq:persp.

Figures (3)

  • Figure 1: Comparison of initial bounds on LOOCV ($k$-fold with $k=n$) from Algorithm \ref{['alg:bounds']} (left) and bounds after running Algorithm \ref{['alg:parametricK']} (right) for a synthetic sparse regression instance where $p=20, n=200, \tau_{\text{true}}=10$, for varying $\tau$. The black number in the top middle depicts the iteration number of the method.
  • Figure 2: Reduction in the number of MIO solved (left) and the total number of branch-and-bound nodes (right) when using Algorithm \ref{['alg:parametricK']} for leave-one-out cross-validation, when compared with Grid (i.e., independently solving $\mathcal{O}(pn)$ MIOs) in four real datasets. The distributions shown in the figure correspond to solving the same instance with different values of $\gamma$. All MIOs are solved to optimality, without imposing any time limits.
  • Figure 3: Reduction in the number of MIO solved (left) and the total number of branch-and-bound nodes (right) when using Algorithm \ref{['alg:parametricK']} for 10-fold cross-validation, when compared with Grid (i.e., independently solving $\mathcal{O}(pk)$ MIOs) in four real datasets. The distributions shown in the figure correspond to solving the same instance with different values of $\gamma$. All MIOs are solved to optimality, without imposing any time limits.

Theorems & Definitions (12)

  • Theorem 1
  • Remark 1: Computability of the bounds
  • Theorem 2
  • Corollary 1
  • Corollary 2
  • Remark 2: Relaxation Tightness
  • Remark 3: Intuition
  • Proposition 1
  • Proposition 2
  • Corollary 3
  • ...and 2 more