Table of Contents
Fetching ...

Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications

Anant Mathur, Benoit Liquet, Samuel Muller, Sarat Moka

Abstract

High-dimensional biomedical studies require models that are simultaneously accurate, sparse, and interpretable, yet exact best subset selection for generalized linear models is computationally intractable. We develop a scalable method that combines a continuous Boolean relaxation of the subset problem with a Frank--Wolfe algorithm driven by envelope gradients. The resulting method, which we refer to as COMBSS-GLM, is simple to implement, requires one penalized generalized linear model fit per iteration, and produces sparse models along a model-size path. Theoretically, we identify a curvature-based parameter regime in which the relaxed objective is concave in the selection weights, implying that global minimizers occur at binary corners. Empirically, in logistic and multinomial simulations across low- and high-dimensional correlated settings, the proposed method consistently improves variable-selection quality relative to established penalised likelihood competitors while maintaining strong predictive performance. In biomedical applications, it recovers established loci in a binary-outcome rice genome-wide association study and achieves perfect multiclass test accuracy on the Khan SRBCT cancer dataset using a small subset of genes. Open-source implementations are available in R at https://github.com/benoit-liquet/COMBSS-GLM-R and in Python at https://github.com/saratmoka/COMBSS-GLM-Python.

Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications

Abstract

High-dimensional biomedical studies require models that are simultaneously accurate, sparse, and interpretable, yet exact best subset selection for generalized linear models is computationally intractable. We develop a scalable method that combines a continuous Boolean relaxation of the subset problem with a Frank--Wolfe algorithm driven by envelope gradients. The resulting method, which we refer to as COMBSS-GLM, is simple to implement, requires one penalized generalized linear model fit per iteration, and produces sparse models along a model-size path. Theoretically, we identify a curvature-based parameter regime in which the relaxed objective is concave in the selection weights, implying that global minimizers occur at binary corners. Empirically, in logistic and multinomial simulations across low- and high-dimensional correlated settings, the proposed method consistently improves variable-selection quality relative to established penalised likelihood competitors while maintaining strong predictive performance. In biomedical applications, it recovers established loci in a binary-outcome rice genome-wide association study and achieves perfect multiclass test accuracy on the Khan SRBCT cancer dataset using a small subset of genes. Open-source implementations are available in R at https://github.com/benoit-liquet/COMBSS-GLM-R and in Python at https://github.com/saratmoka/COMBSS-GLM-Python.
Paper Structure (34 sections, 2 theorems, 51 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 2 theorems, 51 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

For every fixed $\boldsymbol{t}\in(0,1)^{p-m}$, the map $\delta\mapsto f_{\delta,\lambda}(\boldsymbol{t})$ is monotone non-decreasing and continuous in $\delta>0$.

Figures (9)

  • Figure 1: Boolean relaxation surface $f_{\delta,\lambda}(\boldsymbol{t})$ for logistic regression with $\lambda = 0$ and varying curvature parameter $\delta$. The surfaces are plotted over the domain $\boldsymbol{t} = (t_1, t_2) \in [0,1]^2$ with five geometrically increasing values of $\delta$ shown in different colours according to the colour bar. As $\delta$ increases, the relaxed objective becomes more peaked, driving the solution toward binary corners.
  • Figure 2: Performance results in the low-dimensional setting ($n = 200$, $p = 30$) for Case 1 (top row) and Case 2 (bottom row). Each panel displays the average over 50 replications as a function of predictor correlation $\rho \in \{0, 0.2, 0.4, 0.6\}$, with vertical bars denoting one standard error.
  • Figure 3: Performance results in the high-dimensional setting ($n = 200$, $p = 1000$) for Case 1 (top row) and Case 2 (bottom row). Each panel displays the average over 50 replications as a function of predictor correlation $\rho \in \{0, 0.2, 0.4, 0.6\}$, with vertical bars denoting one standard error.
  • Figure 4: (a) Best-subset inclusion path showing the selected SNPs for different model sizes $k = 1$ to $k = 10$. (b) Correlation matrix for the selected SNPs showcasing relationships between the predictors.
  • Figure 5: Best-subset inclusion path for the Khan SRBCT gene expression dataset. Each row corresponds to a model size $k = 1, \ldots, 20$, and each filled cell indicates that the corresponding gene is included in the best subset of that size. The test classification accuracy is displayed at the end of each row, with perfect accuracy ($100\%$) highlighted in red. The path exhibits a monotone nesting property: once a gene enters the model, it remains selected at all larger model sizes.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Proposition 1: Monotonicity and continuity in $\delta$
  • Theorem 1: Concavity threshold in $\boldsymbol{t}$
  • Remark 1
  • Remark 2: Computational complexity
  • proof : Proof of Proposition \ref{['prop:monotone-continuous-delta']}
  • proof : Proof of Theorem \ref{['prop:concavity-threshold-unified']}