Table of Contents
Fetching ...

Naive Feature Selection: a Nearly Tight Convex Relaxation for Sparse Naive Bayes

Armin Askari, Alexandre d'Aspremont, Laurent El Ghaoui

TL;DR

This work introduces a sparse variant of Naive Bayes that enforces a cardinality constraint on the classifier weights, enabling interpretable feature selection with near-linear overall complexity. For binary (Bernoulli) data, the sparse NB problem is solvable exactly, while for multinomial data it admits a computable convex upper bound and a primal recovery method, with tightness guaranteed by Shapley-Folkman-based duality analysis. The authors derive closed-form solutions for the sparse Bernoulli case and develop an efficiently computable 1D convex dual for the sparse multinomial case, both yielding $O(mn + m\log k)$ complexity, where $m$ is the number of features, $n$ the number of samples, and $k$ the sparsity level. They prove that the duality gap diminishes as the marginal contribution of additional features decreases, and validate the approach with extensive experiments on text and genetic data, showing competitive accuracy with state-of-the-art feature selection methods at a fraction of the computational cost. Overall, the paper provides a theoretically grounded, scalable framework for sparse Naive Bayes that enables effective feature selection in very large-scale datasets.

Abstract

Due to its linear complexity, naive Bayes classification remains an attractive supervised learning method, especially in very large-scale settings. We propose a sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our convex relaxation bounds becomes tight as the marginal contribution of additional features decreases, using a priori duality gap bounds dervied from the Shapley-Folkman theorem. We show how to produce primal solutions satisfying these bounds. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared to the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature selection methods such as recursive feature elimination, $l_1$-penalized logistic regression and LASSO, while being orders of magnitude faster.

Naive Feature Selection: a Nearly Tight Convex Relaxation for Sparse Naive Bayes

TL;DR

This work introduces a sparse variant of Naive Bayes that enforces a cardinality constraint on the classifier weights, enabling interpretable feature selection with near-linear overall complexity. For binary (Bernoulli) data, the sparse NB problem is solvable exactly, while for multinomial data it admits a computable convex upper bound and a primal recovery method, with tightness guaranteed by Shapley-Folkman-based duality analysis. The authors derive closed-form solutions for the sparse Bernoulli case and develop an efficiently computable 1D convex dual for the sparse multinomial case, both yielding complexity, where is the number of features, the number of samples, and the sparsity level. They prove that the duality gap diminishes as the marginal contribution of additional features decreases, and validate the approach with extensive experiments on text and genetic data, showing competitive accuracy with state-of-the-art feature selection methods at a fraction of the computational cost. Overall, the paper provides a theoretically grounded, scalable framework for sparse Naive Bayes that enables effective feature selection in very large-scale datasets.

Abstract

Due to its linear complexity, naive Bayes classification remains an attractive supervised learning method, especially in very large-scale settings. We propose a sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our convex relaxation bounds becomes tight as the marginal contribution of additional features decreases, using a priori duality gap bounds dervied from the Shapley-Folkman theorem. We show how to produce primal solutions satisfying these bounds. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared to the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature selection methods such as recursive feature elimination, -penalized logistic regression and LASSO, while being orders of magnitude faster.

Paper Structure

This paper contains 35 sections, 9 theorems, 95 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider the sparse Bernoulli naive Bayes training problem eq:bnb0, with binary data matrix $X \in \{0,1\}^{n \times m}$. The optimal values of the variables are obtained as follows. Set Then identify a set ${\mathcal{I}}$ of indices with the $k$ largest elements in $w-v$, and set ${\theta^{+}_\ast},{\theta^{-}_\ast}$ according to

Figures (4)

  • Figure 1: Experiment 1: Accuracy versus run time with the IMDB dataset/Count Vector with MNB in stage 2, showing performance on par with the best feature selection methods, at fraction of computing cost. Times do not include the cost of grid search to reach the target cardinality for $\ell_1$-based methods. For more details on the experiment, see Appendix \ref{['appendixF']}.
  • Figure 2: Experiment 2 (Left): Accuracy gain for our method (top panel) and factor slower (bottom panel) over all data sets listed in Table \ref{['tab:dsinfo2']} with MNB in stage 2, showing substantial performance increase with a constant increase in computational cost. Experiment 3 (Right): Run time with IMDB dataset/tf-idf vector data set, with increasing $m,k$ with fixed ratio $k/m$, empirically showing (sub-) linear complexity.
  • Figure 3: Experiment 5: Tradeoff of objective value of \ref{['eq:mnb0']} vs sparsity level $k$.
  • Figure 4: Experiment 6: Duality gap bound versus sparsity level for $m = 30$ (top panel) and $m= 3000$ (bottom panel), showing that the duality gap quickly closes as $m$ or $k$ increase.

Theorems & Definitions (10)

  • Theorem 3.1: Sparse Bernoulli naive Bayes
  • Theorem 3.2: Sparse multinomial naive Bayes
  • Definition 3.4
  • Theorem 3.5
  • Theorem 3.6: Quality of Sparse Multinomial Naive Bayes Relaxation
  • Proposition 3.7
  • Proposition 3.8
  • Proposition 3.9
  • Proposition 3.10
  • Lemma 3.11