Learning Mixtures of Experts with EM: A Mirror Descent Perspective
Quentin Fruytier, Aryan Mokhtari, Sujay Sanghavi
TL;DR
The paper establishes a theoretical connection between EM for Mixtures of Experts (MoE) and projected Mirror Descent (MD) on the conditional log-likelihood, proving that EM updates are equivalent to a one-step MD step with a KL regularizer when the complete-data distribution lies in an exponential family. This MD view yields convergence guarantees to stationary points and, under favorable initialization, linear convergence to the true parameters, with rates described via the Missing Information Matrix (MIM) and the signal-to-noise ratio (SNR). The authors further show that for symmetric two-expert MoEs (SymMoLinE and SymMoLogE), EM reduces to MD without projection and provide explicit conditions for linear convergence. They extend the framework to deep and sparse MoE via a principled EM-like approach and validate the theory with synthetic and real-data experiments where EM outperforms gradient-based methods in convergence speed and accuracy. Overall, the work provides a unifying, theory-grounded perspective on training MoE with EM and suggests practical benefits and future directions for scalable EM-based MoE optimization.
Abstract
Classical Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoE-based model architectures have become popular as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent-type methods on the log-likelihood. In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for MoE where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of $2$ linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.
