Table of Contents
Fetching ...

Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

Zhiyuan Zeng, Qipeng Guo, Zhaoye Fei, Zhangyue Yin, Yunhua Zhou, Linyang Li, Tianxiang Sun, Hang Yan, Dahua Lin, Xipeng Qiu

TL;DR

The Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification, effectively handle dropped tokens and padding, respectively and achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.

Abstract

Sparse Mixture of Experts (MoE) models are popular for training large language models due to their computational efficiency. However, the commonly used top-$k$ routing mechanism suffers from redundancy computation and memory costs due to the unbalanced routing. Some experts are overflow, where the exceeding tokens are dropped. While some experts are vacant, which are padded with zeros, negatively impacting model performance. To address the dropped tokens and padding, we propose the Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification. The Intra-GPU Rectification handles dropped tokens, efficiently routing them to experts within the GPU where they are located to avoid inter-GPU communication. The Fill-in Rectification addresses padding by replacing padding tokens with the tokens that have high routing scores. Our experimental results demonstrate that the Intra-GPU Rectification and the Fill-in Rectification effectively handle dropped tokens and padding, respectively. Furthermore, the combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.

Turn Waste into Worth: Rectifying Top-$k$ Router of MoE

TL;DR

The Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification, effectively handle dropped tokens and padding, respectively and achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.

Abstract

Sparse Mixture of Experts (MoE) models are popular for training large language models due to their computational efficiency. However, the commonly used top- routing mechanism suffers from redundancy computation and memory costs due to the unbalanced routing. Some experts are overflow, where the exceeding tokens are dropped. While some experts are vacant, which are padded with zeros, negatively impacting model performance. To address the dropped tokens and padding, we propose the Rectify-Router, comprising the Intra-GPU Rectification and the Fill-in Rectification. The Intra-GPU Rectification handles dropped tokens, efficiently routing them to experts within the GPU where they are located to avoid inter-GPU communication. The Fill-in Rectification addresses padding by replacing padding tokens with the tokens that have high routing scores. Our experimental results demonstrate that the Intra-GPU Rectification and the Fill-in Rectification effectively handle dropped tokens and padding, respectively. Furthermore, the combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.
Paper Structure (32 sections, 6 equations, 3 figures, 9 tables)

This paper contains 32 sections, 6 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The illustration of dropped token and padding in top-$k$ router of MoE. Queue $i$ represents the queue of tokens to be sent to expert $i$. The capacity of each expert is fixed to 3.
  • Figure 2: Left: Post-processing of dropped tokens at GPU 0 with Intra-GPU Rectification. Right: Post-processing of padding at GPU 0 with Fill-in Rectification.
  • Figure 3: The performance of 8-experts and 32-experts MoEs on MMLU, SuperGLUE, TruthfulQA and LogiQA.