Table of Contents
Fetching ...

Mixture of Parrots: Experts improve memorization more than reasoning

Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach

TL;DR

It is shown that as the number of experts is increased, the memorization performance consistently increases while the reasoning capabilities saturate, and that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data.

Abstract

The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.

Mixture of Parrots: Experts improve memorization more than reasoning

TL;DR

It is shown that as the number of experts is increased, the memorization performance consistently increases while the reasoning capabilities saturate, and that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data.

Abstract

The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.

Paper Structure

This paper contains 46 sections, 17 theorems, 27 equations, 12 figures.

Key Result

Theorem 3.2

For some input sequence $G = (V, E)$, fix two disjoint subsets $A, B \subset [N-1]$, and consider a single-layer transformer $f \in \text{Transformer}_{m, H, 1, K}^N$ with $O(\log N)$-bit precision that solves length-2 path for any input $X$ where $X_A$ is a function of edges with the source $s$, $X

Figures (12)

  • Figure 1: (a) Evaluation: world knowledge. We train a series of dense transformers and MoEs on 65B tokens from a corpus essentially made of Fineweb-edu, Cosmopedia and Wikipedia (see \ref{['sec:pretrain']} for details). We then evaluate the models on several world knowledge benchmarks (e.g., TriviaQA joshi2017triviaqa, Natural Questions kwiatkowski2019natural) and report the average F1 accuracy. Surprisingly, at a fixed number of total parameters, MoEs with substantially fewer active parameters approximately match the performance of dense models. This highlights the importance of experts in tasks that require memorization. (b) Evaluation: commonsense. Here we evaluate the aforementioned pre-trained models on natural language commonsense benchmarks (e.g., HellaSwag zellers2019hellaswag, WinoGrande sakaguchi2021winogrande). On these reasoning tasks, we observe that MoEs perform worse than dense models and more significant benefits are obtained by increasing the number of active parameters. (c) Evaluation: math. Here we train a series of dense transformers and MoEs on 65B tokens from a corpus essentially made of Proof-Pile2 azerbayev2023llemma (see \ref{['sec:pretrain']} for details). The results are consistent with the ones in (b): MoEs perform worse than dense models at equal number of total parameters.
  • Figure 2: Illustration of the shortest path task. We feed the model with a sequence that lists all the edges in the input graph and ends with the query (in green) which asks the model to find a shortest path between two vertices (from vertex 1 to vertex 4 in the figure). The model then autoregressively returns the shortest path (in purple).
  • Figure 3: Illustration of the phone-book task for closed-book retrieval. The model is first trained to memorize a phone-book (illustrated on the right). Then, we randomly select a name in the phone-book (in green) and ask the model to return their phone number (in purple) without access to the phone-book.
  • Figure 4: (a) Phone-book memorization: We train a series of dense transformers and MoEs on phone-books of varying sizes and then evaluate their memorization capacity. We report the maximal phone-book size where the model obtains more than 90% accuracy. The maximal phone-book size correlates with the total (and not active) number of parameters. (b) Shortest path (total parameters): We train models to find the shortest path in 50-node graphs and report the test accuracy. Here, increasing the number of experts provides limited improvements and the performance rather correlates with the number of active parameters.
  • Figure 5: Generalization gap when the test set is GSM8k (a) and Hendrycks-MATH (b).
  • ...and 7 more figures

Theorems & Definitions (32)

  • Definition 3.1: Length-2 Path Problem
  • Theorem 3.2: Length-2 path lower-bound on sparse transformers
  • Theorem 3.3: Length-2 path width upper bound for transformer
  • Corollary 3.4
  • proof
  • Theorem 3.5
  • Theorem 3.6: Lower bound for dense model
  • Definition D.1: Set-disjointness task
  • Lemma D.2: Equivalence of set-disjointness and length-2 path
  • proof
  • ...and 22 more