Table of Contents
Fetching ...

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

TL;DR

MoSE tackles the abrupt accuracy-cost trade-offs of sparse MoE models by introducing slimmable experts, enabling width-based adaptation inside each activated expert. It presents a simple yet stable pre-training scheme and supports three inference modes, including a lightweight test-time training (TTT) to learn a width-sharpness mapping under budget, all without retraining. Across GPT2-small to GPT2-medium models trained on OpenWebText, MoSE consistently matches or exceeds standard MoE performance and improves the Pareto frontier when considering compute, especially under TT-width identification. This approach offers a practical path to flexible, compute-aware deployment of large language models with reduced FLOPs while preserving accuracy.

Abstract

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

TL;DR

MoSE tackles the abrupt accuracy-cost trade-offs of sparse MoE models by introducing slimmable experts, enabling width-based adaptation inside each activated expert. It presents a simple yet stable pre-training scheme and supports three inference modes, including a lightweight test-time training (TTT) to learn a width-sharpness mapping under budget, all without retraining. Across GPT2-small to GPT2-medium models trained on OpenWebText, MoSE consistently matches or exceeds standard MoE performance and improves the Pareto frontier when considering compute, especially under TT-width identification. This approach offers a practical path to flexible, compute-aware deployment of large language models with reduced FLOPs while preserving accuracy.

Abstract

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.
Paper Structure (35 sections, 12 equations, 13 figures, 2 tables)

This paper contains 35 sections, 12 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Comparison between standard MoE and MoSE. In the standard case, the router selects a fixed number of full-width experts and activates all their parameters. In the proposed method, the router not only selects multiple experts but also adjusts their widths, allowing more experts to contribute under the same parameter budget. This increases expert diversity without increasing total compute cost, potentially improving model accuracy at the same efficiency.
  • Figure 2: Slimmable expert in MoSE. Example with $w=0.5$, where only half of the intermediate units of the expert FFN are activated by slicing the hidden dimension.
  • Figure 3: Pre-training dynamics of MoE and MoSE on OpenWebText dataset using GPT2-Small model.
  • Figure 4: Compute-quality trade-offs across GPT2-Small, Standard, and Medium models under $\mathbf{E8A2}$ setting. MoSE with test-time training learns compute-aware width identification that shift the Pareto frontier, achieving lower perplexity than uniform-width mode at comparable MFLOPs per token.
  • Figure 5: Scaling pre-training tokens at $\mathrm{E8A2}$ setting. We scale the number of pre-training tokens from 3B (Figure \ref{['fig: n8k2-diff-model-sizes']}) to 15B for GPT2-Small and GPT2-Standard, while keeping the same routing setup and compute budget. Our test-time training for MoSE width identification continues to dominate across the Pareto frontier, and increasing the amount of pre-training data consistently shifts the quality-compute trade-off downward. The relative advantage of test-time training (with both layer-wise and shared parameters) remains stable under increased data scale.
  • ...and 8 more figures