Table of Contents
Fetching ...

Bayesian Multinomial Logistic Regression for Numerous Categories

Jared D. Fisher, Kyle R. McEvoy

TL;DR

This paper adapts a gamma-augmentation strategy to decouple category-specific coefficient updates, so that each category's coefficients can be updated conditional on a single auxiliary variable per subject, rather than on the full set of other categories' coefficients.

Abstract

Bayesian multinomial logistic regression provides a principled, interpretable approach to multiclass classification, but posterior sampling becomes increasingly expensive as the model dimension grows. Prior work has studied scalability in the number of subjects and covariates; in contrast, this paper focuses on how computation changes as the number of outcome categories increases. To improve scalability in settings with numerous categories, we adapt a gamma-augmentation strategy to decouple category-specific coefficient updates, so that each category's coefficients can be updated conditional on a single auxiliary variable per subject, rather than on the full set of other categories' coefficients. Because the resulting coefficient conditionals are non-conjugate, we couple this augmentation with either adaptive Metropolis-Hastings or elliptical slice sampling. Through simulation and a real-data example, we compare effective sample size and effective sampling rate across several standard competitors. We find that the best-performing sampler depends on the dimension and imbalance regime, and that the proposed augmentation provides substantial speedups in scenarios with numerous categories.

Bayesian Multinomial Logistic Regression for Numerous Categories

TL;DR

This paper adapts a gamma-augmentation strategy to decouple category-specific coefficient updates, so that each category's coefficients can be updated conditional on a single auxiliary variable per subject, rather than on the full set of other categories' coefficients.

Abstract

Bayesian multinomial logistic regression provides a principled, interpretable approach to multiclass classification, but posterior sampling becomes increasingly expensive as the model dimension grows. Prior work has studied scalability in the number of subjects and covariates; in contrast, this paper focuses on how computation changes as the number of outcome categories increases. To improve scalability in settings with numerous categories, we adapt a gamma-augmentation strategy to decouple category-specific coefficient updates, so that each category's coefficients can be updated conditional on a single auxiliary variable per subject, rather than on the full set of other categories' coefficients. Because the resulting coefficient conditionals are non-conjugate, we couple this augmentation with either adaptive Metropolis-Hastings or elliptical slice sampling. Through simulation and a real-data example, we compare effective sample size and effective sampling rate across several standard competitors. We find that the best-performing sampler depends on the dimension and imbalance regime, and that the proposed augmentation provides substantial speedups in scenarios with numerous categories.
Paper Structure (10 sections, 9 equations, 2 figures, 1 table)

This paper contains 10 sections, 9 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Comparison of posterior sampling performance metrics: MCMC iterations per second, effective sample size (ESS), and effective sampling rate (ESR). The minimum ESS and ESR are the minimums over all 10*(C-1) parameters, for each simulation replicate; likewise the median ESS and ESR are the medians over all 10*(C-1) parameters. The plotted lines are the median values across 20 simulation replicates, and the shaded regions cover the 5th to 95th percentiles. For each simulation replicate, each method ran for 6000 MCMC samples with 3000 discarded as burn-in.
  • Figure 2: Comparison of posterior sampling performance metrics: MCMC iterations per second, effective sample size (ESS), and effective sampling rate (ESR). The minimum (or median) ESS and ESR are the minimums over all $(P+1)*(C-1)=209$ parameters. The plotted lines are the median values across 20 simulation replicates, and the shaded regions cover the 5th to 95th percentiles. For each simulation replicate, each method ran for 6000 MCMC samples with 3000 discarded as burn-in.