Table of Contents
Fetching ...

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota

TL;DR

It is argued that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling.

Abstract

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

TL;DR

It is argued that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling.

Abstract

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top- routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

Paper Structure

This paper contains 48 sections, 25 figures, 4 tables.

Figures (25)

  • Figure 1: Although training and validation loss decrease as the total number of parameters grows, the task loss on GSM8K can sometimes worsen with larger models. Training and validation losses steadily decrease as total or active parameters increase. The HellaSwag task loss follows this scaling trend, whereas GSM8K task loss worsens once total parameters exceed a threshold. Within each fixed top-k group, moving right on the x-axis corresponds to increasing sparsity (because total experts $E$ increases while $k$ remains fixed), so the right-hand task-loss panels implicitly reflect the same sparsity ordering shown explicitly in Figure \ref{['fig:sparsity_vs_accuracy']}.
  • Figure 2: For GSM8K and GSM-Plus, once the training loss drops below a certain point, the task loss starts to increase. Results of scaling total parameters by increasing the number of experts, with model width and top-$k$ held constant. For TriviaQA and HellaSwag, the task loss falls monotonically as training loss decreases. By contrast, GSM8K and GSM-Plus show a U-shaped trend: task loss declines with training loss only until a threshold, beyond which further reductions in training loss hurt task performance. That threshold moves lower as active parameter count increases, models with more active parameters achieve a lower optimal task loss. No such active parameters dependence appears for TriviaQA, HellaSwag.
  • Figure 3: Downstream accuracy when scaling total parameters via expert count with width and top-$k$ fixed. TriviaQA and HellaSwag exhibit steadily improving accuracy as pre-training loss decreases, whereas GSM8K shows a non-monotonic trend: further reductions in pre-training loss do not always improve accuracy and can even degrade performance.
  • Figure 4: Effect of sparsity on performance across different tasks We vary sparsity (1 - top-$k$/Experts) and plot the relationship between pre-training loss and benchmark error rate, including intermediate checkpoints. For TriviaQA and HellaSwag, the error rate clearly tracks training loss and is largely insensitive to sparsity. In contrast, reasoning skills exhibit a strong dependence of error rate on sparsity.
  • Figure 5: At fixed active parameter counts, higher sparsity (lower density) consistently improves performance, but at larger active parameter counts, GSM8K and GSM-Plus shift their optima back toward dense models. Task loss (top row) and Accuracy (bottom row) against the ratio of active experts $k$ to total experts $E$ for a fixed active parameter budget. In the left two tasks (TriviaQA, HellaSwag), increasing sparsity consistently lowers task loss and raises accuracy across all active parameter budgets, in contrast, in the right two tasks (GSM8K, GSM-Plus), once active parameter counts become large, this trend reverses and denser models begin to outperform their sparser counterparts. Dashed segments mark the inverse‑scaling regime that starts at the black circle; solid segments show the standard scaling region to the right.
  • ...and 20 more figures