Table of Contents
Fetching ...

Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

Youngseog Chung, Dhruv Malik, Jeff Schneider, Yuanzhi Li, Aarti Singh

TL;DR

It is proved that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions, which justifies that Soft MoE's success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert.

Abstract

The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts. The hope is that if the total parameter count of the small experts equals that of the singular large expert, then we retain the representation power of the large expert while gaining computational tractability and promoting expert specialization. The recently introduced Soft MoE replaces the Sparse MoE's discrete routing mechanism with a differentiable gating function that smoothly mixes tokens. While this smooth gating function successfully mitigates the various training instabilities associated with Sparse MoE, it is unclear whether it induces implicit biases that affect Soft MoE's representation power or potential for expert specialization. We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE's success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). Continuing along this line of investigation, we introduce a notion of expert specialization for Soft MoE, and while varying the number of experts yet fixing the total parameter count, we consider the following (computationally intractable) task. Given any input, how can we discover the expert subset that is specialized to predict this input's label? We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference.

Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts

TL;DR

It is proved that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions, which justifies that Soft MoE's success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert.

Abstract

The traditional viewpoint on Sparse Mixture of Experts (MoE) models is that instead of training a single large expert, which is computationally expensive, we can train many small experts. The hope is that if the total parameter count of the small experts equals that of the singular large expert, then we retain the representation power of the large expert while gaining computational tractability and promoting expert specialization. The recently introduced Soft MoE replaces the Sparse MoE's discrete routing mechanism with a differentiable gating function that smoothly mixes tokens. While this smooth gating function successfully mitigates the various training instabilities associated with Sparse MoE, it is unclear whether it induces implicit biases that affect Soft MoE's representation power or potential for expert specialization. We prove that Soft MoE with a single arbitrarily powerful expert cannot represent simple convex functions. This justifies that Soft MoE's success cannot be explained by the traditional viewpoint of many small experts collectively mimicking the representation power of a single large expert, and that multiple experts are actually necessary to achieve good representation power (even for a fixed total parameter count). Continuing along this line of investigation, we introduce a notion of expert specialization for Soft MoE, and while varying the number of experts yet fixing the total parameter count, we consider the following (computationally intractable) task. Given any input, how can we discover the expert subset that is specialized to predict this input's label? We empirically show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset. Our method can be easily implemented to potentially reduce computation during inference.
Paper Structure (23 sections, 1 theorem, 14 equations, 5 figures, 13 tables, 2 algorithms)

This paper contains 23 sections, 1 theorem, 14 equations, 5 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

Fix any $m \geq 2$, $d \geq 1$ and $n = 1$. Define the target function $t: \mathbb{R}^{m \times d} \to \mathbb{R}$ as $t(X) = \| X \|_2$. Assume the existence of $\Phi \in \mathbb{R}^{d \times 1}$, $f: \mathbb{R}^{d} \to \mathbb{R}^{d}$ and $g: \mathbb{R}^{m \times d} \to \mathbb{R}$ such that Then there are no $L_f, L_g \geq 0$ such that $f$ is $L_f$-Lipschitz and $g$ is $L_g$-Lipschitz.

Figures (5)

  • Figure 1: Our Algorithm \ref{['alg:main']} selects a specialized subset of the experts to utilize for inference. Given any proportion of $n$ (the total number of experts) to select, its performance uniformly improves with larger $n$.
  • Figure 2: Results for MNIST, CIFAR10 and ImageNet-1k experiments in Section \ref{['sec:specialization_main_experiment']}. We depict the test accuracy as a function of $n$, for Algorithm \ref{['alg:main']} and Random selection, and for various choices of $k$. For the Random selection results, we report the mean over 10 random seeds. For CIFAR10, we only reported results with $k=1$ and $k=2$. This is because $k > 2$ had accuracies that were nearly identical to using all experts. We hypothesize this is because CIFAR10 is relatively easy for the powerful Astroformer architecture.
  • Figure 3: Results for CIFAR100 experiments from Section \ref{['sec:specialization_main_experiment']}.
  • Figure 4: Loss curves in training Soft MoE models to learn the L2 norm function
  • Figure 5: Each bar presents the number of unique subsets of experts of size $k=n/4$ that were used to produce the highest accuracy, for each setting of the total number of experts $n$.

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Theorem 1