Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models
TrungKhang Tran, TrungTin Nguyen, Md Abul Bashar, Nhat Ho, Richi Nayak, Christopher Drovandi
TL;DR
This work addresses stable maximum-likelihood training and principled model selection for softmax-gated multinomial-logistic MoE classifiers (SGMLMoE) in the full-data regime. It develops a batch MM algorithm with an explicit quadratic surrogate that provides coordinate-wise closed-form updates and guaranteed monotone improvement to a stationary point, avoiding inner EM-like steps. The authors introduce a dendrogram-based, sweep-free model selection pipeline grounded in a Voronoi loss over mixing measures, and establish finite-sample rates for density estimation and parameter recovery, adapting mixing-measure aggregation to the multiclass SGMLMoE setting. Empirical results on synthetic data and a protein–protein interaction task show improved accuracy and probability calibration with a compact, interpretable expert structure, highlighting the method's stability and practical utility for heterogeneous classification tasks.
Abstract
Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein--protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.
