Table of Contents
Fetching ...

Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models

Siqi Wang, Zhengyu Chen, Bei Li, Keqing He, Min Zhang, Jingang Wang

TL;DR

This work investigates whether established Dense-model scaling laws extend to Mixture of Experts (MoE) models in large language models. It introduces a unified loss-scaling framework $\hat{L}(N,D,E) = \frac{A}{N^\alpha E^\gamma} + \frac{B}{D^\beta} + \sigma$ under a compute constraint $\text{FLOPs}(N,D) = C$, and derives architecture-dependent optimal resource allocations $D_{opt}(C)$ and $N_{opt}(C)$. The study shows that MoE models exhibit higher sensitivity to model-scale than data-scale ($\alpha_N$ larger, $\alpha_D$ smaller) and achieve about 16.37% data-efficiency improvements under the same compute budget, with a lower gradient-noise scale when using Adam. Overall, the results indicate that the core principles governing scaling laws are preserved across Dense and MoE architectures, enabling transfer of Dense-model tuning practices to MoE and informing efficient training and deployment of scalable MoE-based LLMs.

Abstract

The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment. Our work investigates the transferability and discrepancies of scaling laws between Dense Models and Mixture of Experts (MoE) models. Through a combination of theoretical analysis and extensive experiments, including consistent loss scaling, optimal batch size and learning rate scaling, and resource allocation strategies scaling, our findings reveal that the power-law scaling framework also applies to MoE Models, indicating that the fundamental principles governing the scaling behavior of these models are preserved, even though the architecture differs. Additionally, MoE Models demonstrate superior generalization, resulting in lower testing losses with the same training compute budget compared to Dense Models. These findings indicate the scaling consistency and transfer generalization capabilities of MoE Models, providing new insights for optimizing MoE Model training and deployment strategies.

Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models

TL;DR

This work investigates whether established Dense-model scaling laws extend to Mixture of Experts (MoE) models in large language models. It introduces a unified loss-scaling framework under a compute constraint , and derives architecture-dependent optimal resource allocations and . The study shows that MoE models exhibit higher sensitivity to model-scale than data-scale ( larger, smaller) and achieve about 16.37% data-efficiency improvements under the same compute budget, with a lower gradient-noise scale when using Adam. Overall, the results indicate that the core principles governing scaling laws are preserved across Dense and MoE architectures, enabling transfer of Dense-model tuning practices to MoE and informing efficient training and deployment of scalable MoE-based LLMs.

Abstract

The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment. Our work investigates the transferability and discrepancies of scaling laws between Dense Models and Mixture of Experts (MoE) models. Through a combination of theoretical analysis and extensive experiments, including consistent loss scaling, optimal batch size and learning rate scaling, and resource allocation strategies scaling, our findings reveal that the power-law scaling framework also applies to MoE Models, indicating that the fundamental principles governing the scaling behavior of these models are preserved, even though the architecture differs. Additionally, MoE Models demonstrate superior generalization, resulting in lower testing losses with the same training compute budget compared to Dense Models. These findings indicate the scaling consistency and transfer generalization capabilities of MoE Models, providing new insights for optimizing MoE Model training and deployment strategies.
Paper Structure (17 sections, 27 equations, 7 figures, 4 tables)

This paper contains 17 sections, 27 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The extrapolated scaling curves for 1.5B Mixture of Experts (MoE) models. This demonstrates that the proposed Loss Scaling Curve $\hat{L}(N,D,E) = \frac{A}{N^\alpha E^\gamma} + \frac{B}{D^\beta} +\sigma (E \textless 100 )$, fits well for MoE Models (eight experts). Specifically, $D$ is the number of tokens and $N$ is the model scale, which is compute budget ($C$) divided by $D$, instead of the model size. $E$ is the number of experts and $\sigma$ represents the random noise scale of dataset. $A$, $B$, $\gamma$, $\alpha$ and $\beta$ are all coefficients.
  • Figure 2: This diagram presents a heatmap of the distribution of training loss in relation to optimal batch size and training token quantities, with fitted curves representing different training loss. A vertical red line connects the minimum values of each curve. (a) training loss vs. optimal batch size when the MoE model size is 100M and the learning rate is 4e-3. (b) training loss vs. optimal batch size when the MoE model size is 700M and the learning rate is 1e-3.
  • Figure 3: We plot the optimal batch size values together with the corresponding training loss values across different model sizes for both Dense Models and MoE Models. In log scale diagrams, (a) demonstrates the log-log relationship of training loss vs. optimal batch size for Dense Models. (b) demonstrates the log-log relationship of training loss vs. optimal batch size for MoE Models. This indicates that the power-law relationship remains consistent not only across model sizes but also across model architectures. The total overlap of the comparative performance interval is about 65.8%.
  • Figure 4: This diagram displays a heatmap showing the distribution of training loss in relation to optimal learning rates and training token quantities, with fitted curves representing different training loss values. A vertical red line connects the minimum points of each curve. (a) Training loss vs. optimal learning rate when the MoE model size is 700M and the batch size is 128 (the number of sequences of length 8192). (b) Training loss vs. optimal learning rate when the MoE model size is 1.5B, with the batch size being 128 (the number of sequences of length 8192).
  • Figure 5: We plot the optimal learning rate values together with the corresponding training loss values across different model sizes for both Dense Models and MoE Models. In log scale diagrams, (a) demonstrates the log-log relationship of training loss vs. optimal learning rate for Dense Models. (b) demonstrates the log-log relationship of training loss vs. optimal learning rate for MoE Models. This indicates that the power-law relationship remains consistent not only across model sizes but also across model architectures. The total overlap of the comparative performance interval is about 76.2%.
  • ...and 2 more figures