Table of Contents
Fetching ...

Sparse Classification: a scalable discrete optimization perspective

Dimitris Bertsimas, Jean Pauphilet, Bart Van Parys

TL;DR

This paper reframes sparse classification as a binary convex optimization problem and introduces an exact, scalable cutting-plane algorithm to solve it, extending the approach to sparse logistic regression and sparse SVM. By leveraging a dual formulation and an enhanced outer-approximation scheme, the method achieves provably optimal sparse classifiers in high dimensions (up to tens of thousands of features) and exhibits favorable support-recovery behavior compared to Lasso, including sparser models with similar predictive power on real data. The authors establish an information-theoretic sufficient condition for recovery, showing that, under certain data-generating assumptions, a sample size threshold $n_0 < C(2+\sigma^2) k \log(p-k)$ suffices for reliable support recovery. Together, the empirical results and theory demonstrate that exact sparse classification can be both computationally tractable in practice and theoretically well-founded, offering a competitive and interpretable alternative to $\,\ell_1$-regularized methods in high-dimensional settings.

Abstract

We formulate the sparse classification problem of $n$ samples with $p$ features as a binary convex optimization problem and propose a cutting-plane algorithm to solve it exactly. For sparse logistic regression and sparse SVM, our algorithm finds optimal solutions for $n$ and $p$ in the $10,000$s within minutes. On synthetic data our algorithm achieves perfect support recovery in the large sample regime. Namely, there exists a $n_0$ such that the algorithm takes a long time to find the optimal solution and does not recover the correct support for $n<n_0$, while for $n\geqslant n_0$, the algorithm quickly detects all the true features, and does not return any false features. In contrast, while Lasso accurately detects all the true features, it persistently returns incorrect features, even as the number of observations increases. Consequently, on numerous real-world experiments, our outer-approximation algorithms returns sparser classifiers while achieving similar predictive accuracy as Lasso. To support our observations, we analyze conditions on the sample size needed to ensure full support recovery in classification. Under some assumptions on the data generating process, we prove that information-theoretic limitations impose $n_0 < C \left(2 + σ^2\right) k \log(p-k)$, for some constant $C>0$.

Sparse Classification: a scalable discrete optimization perspective

TL;DR

This paper reframes sparse classification as a binary convex optimization problem and introduces an exact, scalable cutting-plane algorithm to solve it, extending the approach to sparse logistic regression and sparse SVM. By leveraging a dual formulation and an enhanced outer-approximation scheme, the method achieves provably optimal sparse classifiers in high dimensions (up to tens of thousands of features) and exhibits favorable support-recovery behavior compared to Lasso, including sparser models with similar predictive power on real data. The authors establish an information-theoretic sufficient condition for recovery, showing that, under certain data-generating assumptions, a sample size threshold suffices for reliable support recovery. Together, the empirical results and theory demonstrate that exact sparse classification can be both computationally tractable in practice and theoretically well-founded, offering a competitive and interpretable alternative to -regularized methods in high-dimensional settings.

Abstract

We formulate the sparse classification problem of samples with features as a binary convex optimization problem and propose a cutting-plane algorithm to solve it exactly. For sparse logistic regression and sparse SVM, our algorithm finds optimal solutions for and in the s within minutes. On synthetic data our algorithm achieves perfect support recovery in the large sample regime. Namely, there exists a such that the algorithm takes a long time to find the optimal solution and does not recover the correct support for , while for , the algorithm quickly detects all the true features, and does not return any false features. In contrast, while Lasso accurately detects all the true features, it persistently returns incorrect features, even as the number of observations increases. Consequently, on numerous real-world experiments, our outer-approximation algorithms returns sparser classifiers while achieving similar predictive accuracy as Lasso. To support our observations, we analyze conditions on the sample size needed to ensure full support recovery in classification. Under some assumptions on the data generating process, we prove that information-theoretic limitations impose , for some constant .

Paper Structure

This paper contains 30 sections, 8 theorems, 59 equations, 9 figures, 5 tables.

Key Result

Theorem 1

Under Assumption conv_loss, strong duality holds for problem eqn:reg_class and its dual is where $\hat{\ell}(y, \alpha) := \max_{u \in \mathbb{R}} u \alpha - \ell(y,u)$ is the Fenchel conjugate of the loss function $\ell$boyd2004convex.

Figures (9)

  • Figure 1: Summary of known necessary scarlett2017limits and sufficient (see Theorem \ref{['sufficient']}) conditions on the sample size $n$ to achieve perfect support recovery in classification, when the sparsity $k$ scales linearly in the dimension $p$ and the signal-to-noise ratio is high. Thresholds are given up to a multiplicative constant.
  • Figure 2: Evolution of the accuracy (number of true features selected) as sample size $n$ increases, for ElasticNet with the logistic loss (dashed blue) and sparse SVM (solid red).
  • Figure 3: Evolution of the AUC (left) and misclassification rate (right) on a validation set as sample size $n$ increases, for ElasticNet with logistic loss (dashed blue) and sparse SVM (solid red).
  • Figure 4: Evolution of the number of cuts (left panel) and computational time (right panel) required by the outer-approximation algorithm with Hinge loss as sample size $n$ increases
  • Figure 5: Evolution of the upper (best feasible solution, in green) and lower bounds (in blue) in Algorithm \ref{['OA']} as computational of time (in log scale) increases.
  • ...and 4 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Remark 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Theorem 4
  • Lemma 2
  • Lemma 3
  • Lemma 4