Table of Contents
Fetching ...

Statistical Advantages of Perturbing Cosine Router in Mixture of Experts

Huy Nguyen, Pedram Akbarian, Trang Pham, Trang Nguyen, Shujian Zhang, Nhat Ho

TL;DR

This work analyzes estimation rates in a cosine-router Mixture of Experts (MoE) model and identifies a parameter interaction, formalized as a PDE, that can slow estimation to rates as poor as $\mathcal{O}_P(1/\log^{\tau}(n))$. To address this, it introduces a perturbed cosine router that adds noise to the $L^2$ norms, breaking the interaction while preserving regression-rate $\|g_{\tilde G_n}-g_{G_*}\|_{L^2(\mu)} = \mathcal{O}_P(\sqrt{\log(n)/n})$. Under a strong identifiability condition on the expert functions, the perturbed model achieves polynomial estimation rates for both router parameters and experts, with bounds ranging from $\mathcal{O}_P(\sqrt[4]{\log(n)/n})$ to $\mathcal{O}_P(\sqrt{\log(n)/n})$. Extensive synthetic and real-data experiments (language modeling and domain generalization) corroborate the theoretical improvements, showing faster convergence and improved predictive performance relative to vanilla cosine and linear routers. These results offer practical guidance on router design and identifiability considerations to enhance sample efficiency in MoE deployments.

Abstract

The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as $\mathcal{O}(1/\log^τ(n))$ where $τ> 0$ is some constant and $n$ is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the $\ell^2$-norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.

Statistical Advantages of Perturbing Cosine Router in Mixture of Experts

TL;DR

This work analyzes estimation rates in a cosine-router Mixture of Experts (MoE) model and identifies a parameter interaction, formalized as a PDE, that can slow estimation to rates as poor as . To address this, it introduces a perturbed cosine router that adds noise to the norms, breaking the interaction while preserving regression-rate . Under a strong identifiability condition on the expert functions, the perturbed model achieves polynomial estimation rates for both router parameters and experts, with bounds ranging from to . Extensive synthetic and real-data experiments (language modeling and domain generalization) corroborate the theoretical improvements, showing faster convergence and improved predictive performance relative to vanilla cosine and linear routers. These results offer practical guidance on router design and identifiability considerations to enhance sample efficiency in MoE deployments.

Abstract

The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as where is some constant and is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router -- simply adding noises to the -norms in the cosine router, which we refer to as \textit{perturbed cosine router}. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.
Paper Structure (28 sections, 10 theorems, 121 equations, 2 figures, 5 tables)

This paper contains 28 sections, 10 theorems, 121 equations, 2 figures, 5 tables.

Key Result

Theorem 1

Given the least-square estimator $\widehat{G}_{n}$ defined in equation eq:least_squared_estimator, the regression estimator $f_{\widehat{G}_n}(.)$ converges to the true regression function $f_{G_*}(.)$ at the following rate:

Figures (2)

  • Figure 1: Logarithmic plots displaying empirical convergence rates. Subfigures \ref{['fig:exact_plot']} and \ref{['fig:over_plot']} depict the empirical averages of the Voronoi losses $\mathcal{L}_3(\widehat{G}_n,G_*)$ (cf. equation \ref{['eq:loss_perturbed_exact']}) and $\mathcal{L}_2(\widehat{G}_n,G_*)$ (cf. equation \ref{['eq:loss_perturbed_over']}) for the exact and over-specified settings, respectively. The blue lines depict the Voronoi loss associated with the perturbed router, whereas the green lines are indicative of the Voronoi loss associated with the standard cosine router. The red dash-dotted lines are used to illustrate the fitted lines for determining the empirical convergence rate.
  • Figure 2: Log-log scaled plots displaying the empirical convergence rates. Figure \ref{['fig:regime1_plot']} depicts the empirical averages of the Voronoi losses when using the cosine router (green line) versus when using the perturbed cosine router (blue line). The red dash-dotted lines illustrate the fitted lines for determining the empirical convergence rates. Similarly, Figure \ref{['fig:regime2_plot']} depicts the empirical averages of the Voronoi losses when using the linear router (green line) versus when using the perturbed cosine router (blue line). We use the same data samples for those experiments.

Theorems & Definitions (20)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 1: Strong identifiability
  • Theorem 4
  • Theorem 5
  • Definition 2: Weak identifiability
  • Theorem 6
  • Definition 3: $\varepsilon$-bracket
  • Definition 4: Bracketing number
  • ...and 10 more