Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

TrungKhang Tran; TrungTin Nguyen; Md Abul Bashar; Nhat Ho; Richi Nayak; Christopher Drovandi

Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

TrungKhang Tran, TrungTin Nguyen, Md Abul Bashar, Nhat Ho, Richi Nayak, Christopher Drovandi

TL;DR

This work addresses stable maximum-likelihood training and principled model selection for softmax-gated multinomial-logistic MoE classifiers (SGMLMoE) in the full-data regime. It develops a batch MM algorithm with an explicit quadratic surrogate that provides coordinate-wise closed-form updates and guaranteed monotone improvement to a stationary point, avoiding inner EM-like steps. The authors introduce a dendrogram-based, sweep-free model selection pipeline grounded in a Voronoi loss over mixing measures, and establish finite-sample rates for density estimation and parameter recovery, adapting mixing-measure aggregation to the multiclass SGMLMoE setting. Empirical results on synthetic data and a protein–protein interaction task show improved accuracy and probability calibration with a compact, interpretable expert structure, highlighting the method's stability and practical utility for heterogeneous classification tasks.

Abstract

Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein--protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.

Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

TL;DR

Abstract

Paper Structure (52 sections, 9 theorems, 125 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 52 sections, 9 theorems, 125 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Mixture of Experts Models
Stable Maximum-likelihood Training via MM in the Full-data Regime
Model Selection in SGMLMoE Models
Contributions and Paper Organization
Preliminaries
MM Algorithm for SGMLMoE Models
Fast-rate-aware Aggregation and Sweep-free Model Selection for SGMLMoE
The "Rate Gap" under Overfitting
Mixing-measure Representation for SGMLMoE
Voronoi Cells and Losses for SGMLMoE
Why Merge Experts? The Dendrogram Viewpoint
Finite-sample Rates along the Dendrogram
Likelihood along the Path and Sweep-free Selection
Numerical Experiment
...and 37 more sections

Key Result

Theorem 1

The following bound defines a batch MM surrogate for $-\mathcal{L}({\bm{\theta}})$, with proof deferred to sec:Surrogate_SGMLMoE: Here $C^{(t)}$ is an independent constant w.r.t ${\bm{\theta}}$, $\mathcal{S}_1({\bm{\theta}},{\bm{\theta}}^{(t)})=\sum_{n=1}^{N}\{g_n({\bm{w}}^{(t)})+\bar{{\bm{w}}}_t^{\top}\nabla g_n({\bm{w}}^{(t)})+\frac{1}{2}\bar{{\bm{w}}}_t^{\top}{\bm{B}}_{n,K}\bar{{\bm{w}}}_t\}$,

Figures (6)

Figure 1: Illustration of the DSC merging path on an over-specified fit ($K=4, M = 2$): successive merges remove near-duplicate experts and recover the true expert structure ($K_0 = 2, M = 2$).
Figure 2: Convergence behavior of the Batch MM algorithm (parameter error versus sample size).
Figure 3: Empirical convergence of the Voronoi loss under exact specification and over-specification.
Figure 4: DSC model selection performance.
Figure 5: Dendrogram of mixing measure.
...and 1 more figures

Theorems & Definitions (11)

Theorem 1: Surrogate for the batch negative log-likelihood.
Theorem 2: MM monotonicity for batch SGMLMoE
Theorem 3
Theorem 4: Voronoi monotonicity along the merge chain
Theorem 5: Voronoi and height rates along the path
Theorem 6: Likelihood control along the dendrogram
Theorem 7: Consistency of DSC for SGMLMoE
Lemma 1: Kronecker product preserves Loewner order
proof : Proof of \ref{['lem:prop:bohning_bound_baseline']}
Proposition 1: Uniform bound for baseline-softmax Hessian
...and 1 more

Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

TL;DR

Abstract

Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (11)