Table of Contents
Fetching ...

$μ$-Parametrization for Mixture of Experts

Jan Małaśnicki, Kamil Ciebiera, Mateusz Boruń, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski

TL;DR

This work addresses the high cost of hyperparameter tuning for extremely wide mixture-of-experts models by extending $\mu$-Parameterization to MoE. It develops a principled derivation that reparameterizes MoE components within the TP5 framework, treating expert weights as hidden and the router as output, enabling width-invariant feature learning and zero-shot learning-rate transfer. Empirically, the authors show reliable learning-rate transfer across model widths and demonstrate substantial tuning-cost reductions, with a simpler parameterization (simpleP) also enabling transfer. The study also maps the limits of transfer across MoE scaling axes, notably identifying granularity changes as breaking transfer, thereby guiding robust deployment of large MoE systems.

Abstract

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

$μ$-Parametrization for Mixture of Experts

TL;DR

This work addresses the high cost of hyperparameter tuning for extremely wide mixture-of-experts models by extending -Parameterization to MoE. It develops a principled derivation that reparameterizes MoE components within the TP5 framework, treating expert weights as hidden and the router as output, enabling width-invariant feature learning and zero-shot learning-rate transfer. Empirically, the authors show reliable learning-rate transfer across model widths and demonstrate substantial tuning-cost reductions, with a simpler parameterization (simpleP) also enabling transfer. The study also maps the limits of transfer across MoE scaling axes, notably identifying granularity changes as breaking transfer, thereby guiding robust deployment of large MoE systems.

Abstract

Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a -Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.

Paper Structure

This paper contains 20 sections, 1 theorem, 13 equations, 4 figures, 2 tables.

Key Result

Lemma B.1

Let an $L$‐block Switch‐MoE be µP‐parametrized with width $n\to\infty$, a fixed number of experts $n_{\mathrm{experts}}=O(1)$, and fixed top-$k=O(1)$. For each block $\ell$, define Assume the inductive hypothesis Then for every block $\ell$ and any active expert $e$, at both we have

Figures (4)

  • Figure 1: Learning rate transfer in MoE. Left: Standard Parameterization (SP). Middle: Our simpleP, treating each expert as a feed-forward layer. Right: Our $\mu$P for MoE with both router and expert reparametrization. Under SP, the optimal learning rate depends strongly on width, while both reparameterizations achieve transfer across widths.
  • Figure 2: (a) Varying the number of experts. Given our $\mu$P for MoE, the optimal learning rate is preserved across a varied number of experts. (b) Varying granularity. Learning rate is not preserved across different granularities.
  • Figure 3: This figure shows experiments on learning rate transfer in dense models. SP on the left has a different optimal learning rate for each model width, while $\mu$P has a mostly stable optimum (with slight upward shift, same as TP5 and MoE sweeps).
  • Figure 4: The plots present MoE performance for varying learning rates in the following set-ups: standard parametrization (SP) with no scaling on the left. simpleP - treating each Expert like a FeedForward layer in the middle. $\mu$P - our theory applied to the MoE layer on the right. While in the case of SP, the optimal learning rate varies with different model sizes, both reparameterizations achieve learning rate transfer across model widths.

Theorems & Definitions (2)

  • Lemma B.1: Expert–gradient covariance
  • proof