Table of Contents
Fetching ...

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

Kateřina Henclová, Václav Šmídl

TL;DR

GEMSS presents a variational Bayesian framework to uncover multiple, diverse sparse feature subsets in underdetermined and highly correlated settings, addressing predictive multiplicity beyond a single solution. It uses a structured spike-and-slab prior and a multimodal posterior approximated by a mixture of Gaussians, with a Jaccard-based diversity penalty to encourage distinct supports, optimized in a single objective via Adam. Extensive synthetic benchmarks (128 experiments across 7 tiers) show GEMSS scales to $p$ up to 5000 with $n$ as low as 50, generalizes to regression, handles missing data natively, and remains robust to class imbalance and Gaussian noise. Key findings indicate strong recall and precision in recovering generating features, a hierarchical impact of data quality (missing data being most detrimental), and practical guidelines for solution extraction and regularization. The work enables actionable mechanistic insights alongside prediction, with open-source implementations and a no-code explorer to facilitate adoption in science and industry.

Abstract

Selecting interpretable feature sets in underdetermined ($n \ll p$) and highly correlated regimes constitutes a fundamental challenge in data science, particularly when analyzing physical measurements. In such settings, multiple distinct sparse subsets may explain the response equally well. Identifying these alternatives is crucial for generating domain-specific insights into the underlying mechanisms, yet conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. We present GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational Bayesian framework specifically designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. Unlike sequential greedy approaches, GEMSS optimizes the entire ensemble of solutions within a single objective function via stochastic gradient descent. The method is validated on a comprehensive benchmark comprising 128 synthetic experiments across classification and regression tasks. Results demonstrate that GEMSS scales effectively to high-dimensional settings ($p=5000$) with sample size as small as $n = 50$, generalizes seamlessly to continuous targets, handles missing data natively, and exhibits remarkable robustness to class imbalance and Gaussian noise. GEMSS is available as a Python package 'gemss' at PyPI. The full GitHub repository at https://github.com/kat-er-ina/gemss/ also includes a free, easy-to-use application suitable for non-coders.

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

TL;DR

GEMSS presents a variational Bayesian framework to uncover multiple, diverse sparse feature subsets in underdetermined and highly correlated settings, addressing predictive multiplicity beyond a single solution. It uses a structured spike-and-slab prior and a multimodal posterior approximated by a mixture of Gaussians, with a Jaccard-based diversity penalty to encourage distinct supports, optimized in a single objective via Adam. Extensive synthetic benchmarks (128 experiments across 7 tiers) show GEMSS scales to up to 5000 with as low as 50, generalizes to regression, handles missing data natively, and remains robust to class imbalance and Gaussian noise. Key findings indicate strong recall and precision in recovering generating features, a hierarchical impact of data quality (missing data being most detrimental), and practical guidelines for solution extraction and regularization. The work enables actionable mechanistic insights alongside prediction, with open-source implementations and a no-code explorer to facilitate adoption in science and industry.

Abstract

Selecting interpretable feature sets in underdetermined () and highly correlated regimes constitutes a fundamental challenge in data science, particularly when analyzing physical measurements. In such settings, multiple distinct sparse subsets may explain the response equally well. Identifying these alternatives is crucial for generating domain-specific insights into the underlying mechanisms, yet conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. We present GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational Bayesian framework specifically designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. Unlike sequential greedy approaches, GEMSS optimizes the entire ensemble of solutions within a single objective function via stochastic gradient descent. The method is validated on a comprehensive benchmark comprising 128 synthetic experiments across classification and regression tasks. Results demonstrate that GEMSS scales effectively to high-dimensional settings () with sample size as small as , generalizes seamlessly to continuous targets, handles missing data natively, and exhibits remarkable robustness to class imbalance and Gaussian noise. GEMSS is available as a Python package 'gemss' at PyPI. The full GitHub repository at https://github.com/kat-er-ina/gemss/ also includes a free, easy-to-use application suitable for non-coders.
Paper Structure (66 sections, 10 equations, 23 figures, 25 tables, 1 algorithm)

This paper contains 66 sections, 10 equations, 23 figures, 25 tables, 1 algorithm.

Figures (23)

  • Figure 1: Summary of F1 scores in basic scenarios. (Cases 14 and 16)
  • Figure 2: The effect of noise and missing data on F1 score, averaged over the two problem sizes. (Case 23)
  • Figure 3: Performance on datasets with unbalanced class prevalence. (Case 31).
  • Figure 4: Overall effect of Jaccard penalty in two sparsity regimes (Case 28).
  • Figure 5: Performance overview for baseline and high-dimensional regression problems (Case 33).
  • ...and 18 more figures