Table of Contents
Fetching ...

MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training

Lu Zhao, Rong Shi, Yue Sun, Shaoqing Zhang, Hongxing Niu, Yueqiang Chen, Baoguo He, Hongfeng Sun, Ziqing Yin, Shangchao Su, Zhiyan Cui, Liang Dong, Xiyuan Li, Lingbin Wang, Jianwei He, Jiesong Ma, Weikang Huang, Jianglei Tong, Dongdong Gao, Jian Zhang, Hong Tian, Hui Shen, Zongtai Luo, Zhaoqun Sun, Hongxing Niu, Yue Sun

TL;DR

This work tackles the memory bottleneck in large-scale MoE training caused by dynamic token routing and severe load imbalance on memory-constrained GPUs. It introduces MemFine, a memory-aware fine-grained scheduling framework that partitions token distribution and expert computation into chunks and applies chunked recomputation, guided by a theoretical memory cost model. The key contributions are the Fine-grained Chunk Distribution Algorithm (FCDA) and Memory-Aware Chunk Tuning (MACT), which together reduce activation memory and improve end-to-end throughput without altering routing. Empirical results show a 48.03% reduction in activation memory and a 4.42% throughput gain over full recomputation baselines, enabling stable training of ultra-large MoE models on GPUs with limited memory and expanding practical scalability.

Abstract

The training of large-scale Mixture of Experts (MoE) models faces a critical memory bottleneck due to severe load imbalance caused by dynamic token routing. This imbalance leads to memory overflow on GPUs with limited capacity, constraining model scalability. Existing load balancing methods, which cap expert capacity, compromise model accuracy and fail on memory-constrained hardware. To address this, we propose MemFine, a memory-aware fine-grained scheduling framework for MoE training. MemFine decomposes the token distribution and expert computation into manageable chunks and employs a chunked recomputation strategy, dynamically optimized through a theoretical memory model to balance memory efficiency and throughput. Experiments demonstrate that MemFine reduces activation memory by 48.03% and improves throughput by 4.42% compared to full recomputation-based baselines, enabling stable large-scale MoE training on memory-limited GPUs.

MemFine: Memory-Aware Fine-Grained Scheduling for MoE Training

TL;DR

This work tackles the memory bottleneck in large-scale MoE training caused by dynamic token routing and severe load imbalance on memory-constrained GPUs. It introduces MemFine, a memory-aware fine-grained scheduling framework that partitions token distribution and expert computation into chunks and applies chunked recomputation, guided by a theoretical memory cost model. The key contributions are the Fine-grained Chunk Distribution Algorithm (FCDA) and Memory-Aware Chunk Tuning (MACT), which together reduce activation memory and improve end-to-end throughput without altering routing. Empirical results show a 48.03% reduction in activation memory and a 4.42% throughput gain over full recomputation baselines, enabling stable training of ultra-large MoE models on GPUs with limited memory and expanding practical scalability.

Abstract

The training of large-scale Mixture of Experts (MoE) models faces a critical memory bottleneck due to severe load imbalance caused by dynamic token routing. This imbalance leads to memory overflow on GPUs with limited capacity, constraining model scalability. Existing load balancing methods, which cap expert capacity, compromise model accuracy and fail on memory-constrained hardware. To address this, we propose MemFine, a memory-aware fine-grained scheduling framework for MoE training. MemFine decomposes the token distribution and expert computation into manageable chunks and employs a chunked recomputation strategy, dynamically optimized through a theoretical memory model to balance memory efficiency and throughput. Experiments demonstrate that MemFine reduces activation memory by 48.03% and improves throughput by 4.42% compared to full recomputation-based baselines, enabling stable large-scale MoE training on memory-limited GPUs.

Paper Structure

This paper contains 12 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: General Architecture of MoE Models. The blue dots denote the stored activation.
  • Figure 2: The number of received tokens per MoE layer. Take the 7-th iteration for an example.
  • Figure 3: The workflow of MimFine.
  • Figure 4: Throughput comparison of three methods.
  • Figure 5: Trend of chunk values during training of Model I with Method 3.