$μ$-Parametrization for Mixture of Experts
Jan Małaśnicki, Kamil Ciebiera, Mateusz Boruń, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski
TL;DR
This work addresses the high cost of hyperparameter tuning for extremely wide mixture-of-experts models by extending $\mu$-Parameterization to MoE. It develops a principled derivation that reparameterizes MoE components within the TP5 framework, treating expert weights as hidden and the router as output, enabling width-invariant feature learning and zero-shot learning-rate transfer. Empirically, the authors show reliable learning-rate transfer across model widths and demonstrate substantial tuning-cost reductions, with a simpler parameterization (simpleP) also enabling transfer. The study also maps the limits of transfer across MoE scaling axes, notably identifying granularity changes as breaking transfer, thereby guiding robust deployment of large MoE systems.
Abstract
Recent years have seen a growing interest and adoption of LLMs, with Mixture-of-Experts (MoE) emerging as a leading architecture in extremely large models. Currently, the largest open-source models reach over $1$T parameters. At such scales, hyperparameter tuning becomes prohibitively expensive. Precisely for this reason, the $μ$Transfer is becoming a key technique. It allows for seamless transfer of optimal hyperparameters across model scales, resulting in a huge reduction in tuning costs. However, existing work has primarily focused on dense LLMs, leaving MoE architectures unexplored. In this work, we derive a $μ$-Parameterization for MoE, providing theoretical guarantees for feature learning across model widths. Our experiments demonstrate that the optimal learning rate reliably transfers across model sizes, establishing a foundation for efficient hyperparameter tuning in large-scale MoE models.
