Table of Contents
Fetching ...

Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

Huy Nguyen, Nhat Ho, Alessandro Rinaldo

TL;DR

This paper verifies theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation and demonstrates that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating than those under softmax gating.

Abstract

The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.

Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts

TL;DR

This paper verifies theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation and demonstrates that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating than those under softmax gating.

Abstract

The softmax gating function is arguably the most popular choice in mixture of experts modeling. Despite its widespread use in practice, the softmax gating may lead to unnecessary competition among experts, potentially causing the undesirable phenomenon of representation collapse due to its inherent structure. In response, the sigmoid gating function has been recently proposed as an alternative and has been demonstrated empirically to achieve superior performance. However, a rigorous examination of the sigmoid gating function is lacking in current literature. In this paper, we verify theoretically that the sigmoid gating, in fact, enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation. Towards that goal, we consider a regression framework in which the unknown regression function is modeled as a mixture of experts, and study the rates of convergence of the least squares estimator under the over-specified case in which the number of fitted experts is larger than the true value. We show that two gating regimes naturally arise and, in each of them, we formulate an identifiability condition for the expert functions and derive the corresponding convergence rates. In both cases, we find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating than those under softmax gating. Furthermore, given the same choice of experts, we demonstrate that the sigmoid gating function requires a smaller sample size than its softmax counterpart to attain the same error of expert estimation and, therefore, is more sample efficient.
Paper Structure (15 sections, 9 theorems, 95 equations, 2 figures, 4 tables)

This paper contains 15 sections, 9 theorems, 95 equations, 2 figures, 4 tables.

Key Result

Theorem 1

Under the Regime 1 and with the least squares estimator $\widehat{G}_n$ defined in equation eq:least_squared_estimator, the regression estimator $f_{\widehat{G}_n}$ admits the following rate of convergence to $f_{G_*}$:

Figures (2)

  • Figure 1: Log-log scaled plots displaying the empirical averages of the Voronoi losses when using the sigmoid gating (blue line) versus when using the sofmax gating (green line) under the same data. The red dash-dotted lines illustrate the fitted lines for determining the empirical convergence rates.
  • Figure 2: Logarithmic plots of empirical convergence rates. Figures \ref{['fig:regime1_plot']} and \ref{['fig:regime2_plot']} illustrate the empirical averages of the corresponding Voronoi losses under the Regime 1 and the Regime 2, respectively. The blue lines depict the Voronoi loss associated with the $\mathrm{ReLU}$ experts, while the green lines correspond to that of the linear experts. The red dash-dotted lines are used to illustrate the fitted lines for determining the empirical convergence rates. See Appendix \ref{['appendix:experiments']} for the experimental details.

Theorems & Definitions (13)

  • Theorem 1
  • Corollary 1
  • Definition 1: Strong identifiability
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Definition 2: Weak identifiability
  • Theorem 5
  • Lemma 1
  • Lemma 2
  • ...and 3 more