Table of Contents
Fetching ...

On Least Square Estimation in Softmax Gating Mixture of Experts

Huy Nguyen, Nhat Ho, Alessandro Rinaldo

TL;DR

This work investigates the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored.

Abstract

Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions $\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.

On Least Square Estimation in Softmax Gating Mixture of Experts

TL;DR

This work investigates the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored.

Abstract

Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions and , are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.
Paper Structure (19 sections, 9 theorems, 127 equations, 2 figures, 2 tables)

This paper contains 19 sections, 9 theorems, 127 equations, 2 figures, 2 tables.

Key Result

Theorem 2.1

Given a least squares estimator $\widehat{G}_n$ defined in equation eq:least_squared_estimator, the model estimation $f_{\widehat{G}_n}$ admits the following convergence rate:

Figures (2)

  • Figure 1: Log-log scaled plots illustrating empirical convergence rates of parameter estimation in the softmax gating mixture of linear experts under the exact-specified setting (Figure \ref{['fig:linear_exact']}) and the over-specified setting (Figure \ref{['fig:linear_over']}). The blue curves depict the mean discrepancy between the least squares estimator $\widehat{G}_n$ and the true mixing measure $G_*$ under the loss $\mathcal{D}_{3,r}$, accompanied by error bars signifying two empirical standard deviations. Additionally, an orange dash-dotted line represents the least-squares fitted linear regression line for these data points.
  • Figure 2: Log-log scaled plots illustrating empirical convergence rates of parameter estimation in the softmax gating mixture of ridge experts with the sigmoid activation under the exact-specified setting (Figure \ref{['fig:sigmoid_exact']}) and the over-specified setting (Figure \ref{['fig:sigmoid_over']}). The blue curves depict the mean discrepancy between the least squares estimator $\widehat{G}_n$ and the true mixing measure $G_*$ under the loss $\mathcal{D}_{2}$, accompanied by error bars signifying two empirical standard deviations. Additionally, an orange dash-dotted line represents the least-squares fitted linear regression line for these data points.

Theorems & Definitions (12)

  • Theorem 2.1
  • Definition 3.1: Strong Identifiability
  • Theorem 3.2
  • Definition 4.1: Strong Independence
  • Theorem 4.2
  • Proposition 4.3
  • Theorem 4.4
  • Proposition 4.5
  • Theorem 4.6
  • Lemma 1.1
  • ...and 2 more