Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Huy Nguyen; Pedram Akbarian; Fanqi Yan; Nhat Ho

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Huy Nguyen, Pedram Akbarian, Fanqi Yan, Nhat Ho

TL;DR

The findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation.

Abstract

Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts $k_{\ast}$ is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when $k_{\ast}$ becomes unknown and the true model is over-specified by a Gaussian mixture of $k$ experts where $k > k_{\ast}$, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

TL;DR

Abstract

is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when

becomes unknown and the true model is over-specified by a Gaussian mixture of

experts where

, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.

Paper Structure (27 sections, 12 theorems, 177 equations, 3 figures, 1 table)

This paper contains 27 sections, 12 theorems, 177 equations, 3 figures, 1 table.

Introduction
Exact-specified Settings
Over-specified Settings
Practical Implications
Conclusion and Future Directions
Proof for Results under the Exact-specified Settings
Proof of Theorem \ref{['theorem:exact_fitted_density']}
Main Proof
Proof of Lemma \ref{['lemma:covering_bracketing_bound']}
Proof of Theorem \ref{['theorem:exact_fitted_MLE']}
Proof of Lemma \ref{['lemma:partition_exact']}
Proof for Results under Over-specified Settings
Proof of Theorem \ref{['theorem:over_fitted_density']}
Proof of Theorem \ref{['theorem:over_fitted_MLE']}
Proof of Proposition \ref{['prop:K_bar_bound']}
...and 12 more sections

Key Result

Lemma 1

For any $i\in[k_*]$, let $\beta_{1i},\beta^*_{1i}\in\mathbb{R}^{d}$ such that $\|\beta_{1i}-\beta^*_{1i}\|\leq \eta_i$ for some sufficiently small $\eta_i>0$. Then, for any $\ell\in[q]$, unless the set $\mathcal{X}^*_{\ell}$ has measure zero, we obtain that $\mathcal{X}^*_{\ell}= \mathcal{X}_{\ell}$

Figures (3)

Figure 1: Illustration of two partitions of the input space with respect to the $\mathrm{TopK}$ function in the density estimation $g_{\widehat{G}_n}$(left) and the true density $g_{G_*}$(right) under the exact-specified settings when $k_*=3$ and $K=1$. Here, the regions labelled as $\boldsymbol{1_n}$ and $\boldsymbol{1^*}$ contain $X\in\mathcal{X}$ such that $(\widehat{\beta}^n_{11})^{\top}X$ and $(\beta^*_{11})^{\top}X$ are the top-1 elements of $((\widehat{\beta}^n_{1i})^{\top}X)_{i=1}^{3}$ and $((\beta^*_{1i})^{\top}X)_{i=1}^{3}$, respectively. Other regions are defined similarly. Assume that $\widehat{\beta}^n_{1i}\to\beta^*_{1i}$ as $n\to\infty$ for any $i\in\{1,2,3\}$, then the regions $\boldsymbol{1_n},\boldsymbol{2_n},\boldsymbol{3_n}$ should respectively match their counterparts $\boldsymbol{1^*},\boldsymbol{2^*},\boldsymbol{3^*}$ to guarantee the convergence of $g_{\widehat{G}_n}$ to $g_{G_*}$. Lemma \ref{['lemma:partition_exact']} reads that this property holds when the sample size $n$ is sufficiently large.
Figure 2: A visual representation showcasing the relationship between $X$ and $Y$, along with their respective marginal distributions when $K = 1$ and $K = 2$.
Figure 3: Log-log scaled plots illustrating simulation results under the exact-specified and the over-specified settings. We analyze the MLE $\widehat{G}_n$ across 40 independent samples, spanning sample sizes from $10^2$ to $10^5$. The blue curves depict the mean discrepancy between the MLE $\widehat{G}_n$ and the true mixing measure $G_*$, accompanied by error bars signifying two empirical standard deviations under the exact-specified settings. Additionally, an orange dash-dotted line represents the least-squares fitted linear regression line for these data points.

Theorems & Definitions (15)

Lemma 1
Theorem 1: Density estimation rate
Theorem 2: Parameter estimation rate
Proposition 1
Lemma 2
Theorem 3: Density estimation rate
Theorem 4: Parameter estimation rate
Lemma 3
Lemma 4: Theorem 7.4, Vandegeer-2000
Proposition 2: Identifiability
...and 5 more

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

TL;DR

Abstract

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (15)