Table of Contents
Fetching ...

Comparing Variable Selection and Model Averaging Methods for Logistic Regression

Nikola Sekulovski, František Bartoš, Don van den Bergh, Giuseppe Arena, Henrik R. Godmann, Vipasha Goyal, Julius M. Pfadt, Maarten Marsman, Adrian E. Raftery

Abstract

Model uncertainty is a central challenge in statistical models for binary outcomes such as logistic regression, arising when it is unclear which predictors should be included in the model. Many methods have been proposed to address this issue for logistic regression, but their relative performance under realistic conditions remains poorly understood. We therefore conducted a preregistered, simulation-based comparison of 28 established methods for variable selection and inference under model uncertainty, using 11 empirical datasets spanning a range of sample sizes and number of predictors, in cases both with and without separation. We found that Bayesian model averaging (BMA) methods based on g-priors, particularly g = max(n, p^2), show the strongest overall performance when separation is absent. When separation occurs, penalized likelihood approaches, especially the LASSO, provide the most stable results, while BMA with the local empirical Bayes (EB-local) prior is competitive in both situations. These findings offer practical guidance for applied researchers on how to effectively address model uncertainty in logistic regression in modern empirical and machine learning research.

Comparing Variable Selection and Model Averaging Methods for Logistic Regression

Abstract

Model uncertainty is a central challenge in statistical models for binary outcomes such as logistic regression, arising when it is unclear which predictors should be included in the model. Many methods have been proposed to address this issue for logistic regression, but their relative performance under realistic conditions remains poorly understood. We therefore conducted a preregistered, simulation-based comparison of 28 established methods for variable selection and inference under model uncertainty, using 11 empirical datasets spanning a range of sample sizes and number of predictors, in cases both with and without separation. We found that Bayesian model averaging (BMA) methods based on g-priors, particularly g = max(n, p^2), show the strongest overall performance when separation is absent. When separation occurs, penalized likelihood approaches, especially the LASSO, provide the most stable results, while BMA with the local empirical Bayes (EB-local) prior is competitive in both situations. These findings offer practical guidance for applied researchers on how to effectively address model uncertainty in logistic regression in modern empirical and machine learning research.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Scores, performance, and additional metrics for 28 logistic regression methods with model uncertainty, averaged across 11 simulations on datasets without separation. Methods are ranked by the Partial Score, computed from RMSE (point estimation) and Brier score (prediction). All scores and performance metrics are standardized relative to the Spike–and–Slab method. Lower scores indicate better performance (blue), higher scores worse (orange/red). Additional metrics show average CPU time (minutes) and the proportion of failed models.
  • Figure 2: Scores, performance, and additional metrics for 28 logistic regression methods with model uncertainty, averaged across 11 simulations on datasets with separation. Methods are ranked by the Partial Score, computed from RMSE (point estimation) and Brier score (prediction). All scores are standardized relative to the Spike–and–Slab method. Lower scores indicate better performance (blue), higher scores worse (orange/red). Additional metrics show average CPU time (minutes) and the proportion of failed models.
  • Figure 3: Relationship between the number of predictors (p) and the number of observations (n) across the empirical data sets used in the simulations. Both axes are plotted on a log 10 scale to aid interpretability, but tick marks reflect the original (untransformed) values. The dashed line indicates the identity line (n = p)