Table of Contents
Fetching ...

A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

Huy Nguyen, Pedram Akbarian, TrungTin Nguyen, Nhat Ho

TL;DR

This work analyzes convergence for density estimation and parameter estimation in the classification setting of softmax-gating multinomial logistic MoEs, revealing a PDE-driven interaction between gating and expert parameters that can slow or prevent polynomial-rate estimation when some expert components collapse. To address this, the authors introduce a Voronoi-based loss D_r and establish density-estimation rates of the form tilde{O}(n^{-1/2}) under both standard and modified gating. They show that, for the standard gate, Regime 1 yields exact-parameter rates of tilde{O}(n^{-1/2}) and over-specified rates of tilde{O}(n^{-1/4}); Regime 2 yields a minimax lower bound of tilde{O}(n^{-1/2}) with potential slower-than-polynomial rates for over-specified parameters due to PDE interactions. To overcome these limitations, they propose a modified softmax gate with a transformation M(X) that removes the gating-expert interaction, preserving identifiability and achieving parametric density and favorable parameter-estimation rates regardless of expert-collapse, thereby stabilizing training and improving practical performance.

Abstract

Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.

A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts

TL;DR

This work analyzes convergence for density estimation and parameter estimation in the classification setting of softmax-gating multinomial logistic MoEs, revealing a PDE-driven interaction between gating and expert parameters that can slow or prevent polynomial-rate estimation when some expert components collapse. To address this, the authors introduce a Voronoi-based loss D_r and establish density-estimation rates of the form tilde{O}(n^{-1/2}) under both standard and modified gating. They show that, for the standard gate, Regime 1 yields exact-parameter rates of tilde{O}(n^{-1/2}) and over-specified rates of tilde{O}(n^{-1/4}); Regime 2 yields a minimax lower bound of tilde{O}(n^{-1/2}) with potential slower-than-polynomial rates for over-specified parameters due to PDE interactions. To overcome these limitations, they propose a modified softmax gate with a transformation M(X) that removes the gating-expert interaction, preserving identifiability and achieving parametric density and favorable parameter-estimation rates regardless of expert-collapse, thereby stabilizing training and improving practical performance.

Abstract

Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
Paper Structure (29 sections, 10 theorems, 142 equations, 3 figures, 3 tables)

This paper contains 29 sections, 10 theorems, 142 equations, 3 figures, 3 tables.

Key Result

Proposition 2.1

Given two mixing measures $G$ and $G'$ in $\mathcal{O}_k(\Theta)$, if $g_{G}(Y|X)=g_{G'}(Y|X)$ holds true for almost surely $(X,Y)$, then $G\equiv G'$.

Figures (3)

  • Figure 1: Two log-log scaled plots for the empirical convergence rates of the MLE $\widehat{G}_n$ when the true model in equation \ref{['eq:new_density']} is over-specified by a softmax gating binomial logistic mixture of $k=3$ and $k=4$ experts, respectively. In these figures, the empirical means of the discrepancy $\mathcal{D}_2(\widehat{G}_n,G_*)$ are illustrated by the blue curves, while the oranges dash-dotted lines represent for the least-squares fitted linear regression lines.
  • Figure 2: log-log scaled plots for the empirical convergence rates of the MLE $\widehat{G}_n$ when the true model in equation \ref{['eq:new_density2']} is over-specified by a softmax gating binomial logistic mixture with $M(X) = X$ and $M(X) = \mathrm{sigmoid}(X)$ of $k=3$ and $k=4$ experts, respectively. In these figures, the empirical means of the discrepancy $\mathcal{D}_2(\widehat{G}_n,G_*)$ are illustrated by the blue curves, while the oranges dash-dotted lines represent for the least-squares fitted linear regression lines.
  • Figure 3: Empirical convergence rates of the EM algorithm with the standard softmax gating function and three different modified softmax gating functions with $M(X)\in\{\sin(X),\cos(X),\log(|X|)\}$. The y-axis indicates the negative log-likelihood, while the x-axis illustrates the number of EM iterations.

Theorems & Definitions (12)

  • Proposition 2.1: Identifiability
  • Proposition 2.2: Density Estimation Rate
  • Theorem 3.1: Parameter Estimation Rate
  • Proposition 3.2
  • Theorem 3.3: Minimax Lower Bound
  • Definition 4.1: Modified Function
  • Proposition 4.2: Identifiability
  • Proposition 4.3: Density Estimation Rate
  • Theorem 4.4: Parameter Estimation Rate
  • Lemma 2.1
  • ...and 2 more