Table of Contents
Fetching ...

Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, Tong Yang

TL;DR

This work tackles the challenge of imbalanced expert loads in sparse MoE-based large language models during training. By tracing expert activations in GPT-3 scale MoEs, it identifies a transient state with pronounced fluctuations and a subsequent stable state with temporal locality, then validates state-aware forecasting using three classical predictors: LSTM, ARIMA, and Sliding Window Avg. In GPT-3 125M, the predictors reach near-optimal accuracy in predicting load proportions for up to 1{,}000 iterations, with LSTM achieving sub-1% errors in the stable regime and SW_Avg achieving about 0.25% error; in GPT-3 350M, ARIMA and SW_Avg maintain robust performance with stable-state errors around 1–1.4%, while LSTM remains effective though with higher transient errors. These results offer practical guidance for dynamic expert placement and resource allocation during MoE training and motivate state-aware scheduling strategies for future work. $\text{(Key quantitative results: }$ $1.3\%$–$1.8\%$ average error for 1k/2k steps on GPT-3 350M; $<1\%$ transient-to-stable improvement for GPT-3 125M; $0.25\%$ SW_Avg stable error on GPT-3 125M$)$

Abstract

MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with "obvious load fluctuation" and the stable state with "temporal locality". Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction results. For the GPT3 350M model, the average error rates for predicting the expert load proportion over the next 1,000 and 2,000 steps are approximately 1.3% and 1.8%, respectively. This work can provide valuable guidance for expert placement or resource allocation for MoE model training. Based on this work, we will propose an expert placement scheme for transient and stable states in our coming work.

Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

TL;DR

This work tackles the challenge of imbalanced expert loads in sparse MoE-based large language models during training. By tracing expert activations in GPT-3 scale MoEs, it identifies a transient state with pronounced fluctuations and a subsequent stable state with temporal locality, then validates state-aware forecasting using three classical predictors: LSTM, ARIMA, and Sliding Window Avg. In GPT-3 125M, the predictors reach near-optimal accuracy in predicting load proportions for up to 1{,}000 iterations, with LSTM achieving sub-1% errors in the stable regime and SW_Avg achieving about 0.25% error; in GPT-3 350M, ARIMA and SW_Avg maintain robust performance with stable-state errors around 1–1.4%, while LSTM remains effective though with higher transient errors. These results offer practical guidance for dynamic expert placement and resource allocation during MoE training and motivate state-aware scheduling strategies for future work. average error for 1k/2k steps on GPT-3 350M; transient-to-stable improvement for GPT-3 125M; SW_Avg stable error on GPT-3 125M

Abstract

MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with "obvious load fluctuation" and the stable state with "temporal locality". Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction results. For the GPT3 350M model, the average error rates for predicting the expert load proportion over the next 1,000 and 2,000 steps are approximately 1.3% and 1.8%, respectively. This work can provide valuable guidance for expert placement or resource allocation for MoE model training. Based on this work, we will propose an expert placement scheme for transient and stable states in our coming work.
Paper Structure (16 sections, 11 figures, 1 table)

This paper contains 16 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Load proportions of experts in the MoE layer
  • Figure 2: Variance values of experts load proportion of GPT-3 125M (w=10 and w=100)
  • Figure 3: Variance values of experts load proportion of GPT-3 125M (w=100)
  • Figure 4: Range values of experts load proportion of GPT-3 125M (w=100)
  • Figure 5: Prediction accuracy for GPT-3 125M
  • ...and 6 more figures