Table of Contents
Fetching ...

Sparse Checkpointing for Fast and Reliable MoE Training

Swapnil Gandhi, Christos Kozyrakis

TL;DR

Evaluations show that MoEtion reduces checkpointing overhead and recovery overhead by up to \(4\times\) and recovery overhead by up to \(31\times\) compared to state-of-the-art approaches, sustaining consistently high Effective Training Time Ratios (ETTR) and delivering up to $8-times overall training speedup, all without compromising synchronous training semantics.

Abstract

As large language models scale, training them requires thousands of GPUs over extended durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency. We present MoEtion, a distributed, in-memory checkpointing system tailored for MoE models. MoEtion is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of activations and gradients at pipeline-stage boundaries, enabling localized recovery without re-executing unaffected workers. Evaluations across diverse MoE models with up to 64 experts show that MoEtion reduces checkpointing overhead by up to \(4\times\) and recovery overhead by up to \(31\times\) compared to state-of-the-art approaches, sustaining consistently high Effective Training Time Ratios (ETTR) of up to $\ge 0.94$ even under frequent failures (MTBF as low as 10 minutes) and delivering up to $8\times$ overall training speedup, all without compromising synchronous training semantics. Overall, MoEtion offers a robust and scalable fault-tolerance solution for the next generation of sparsely activated models.

Sparse Checkpointing for Fast and Reliable MoE Training

TL;DR

Evaluations show that MoEtion reduces checkpointing overhead and recovery overhead by up to and recovery overhead by up to compared to state-of-the-art approaches, sustaining consistently high Effective Training Time Ratios (ETTR) and delivering up to $8-times overall training speedup, all without compromising synchronous training semantics.

Abstract

As large language models scale, training them requires thousands of GPUs over extended durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency. We present MoEtion, a distributed, in-memory checkpointing system tailored for MoE models. MoEtion is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of activations and gradients at pipeline-stage boundaries, enabling localized recovery without re-executing unaffected workers. Evaluations across diverse MoE models with up to 64 experts show that MoEtion reduces checkpointing overhead by up to and recovery overhead by up to compared to state-of-the-art approaches, sustaining consistently high Effective Training Time Ratios (ETTR) of up to even under frequent failures (MTBF as low as 10 minutes) and delivering up to overall training speedup, all without compromising synchronous training semantics. Overall, MoEtion offers a robust and scalable fault-tolerance solution for the next generation of sparsely activated models.

Paper Structure

This paper contains 36 sections, 13 equations, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Performance of Gemini gemini during training of DeepSeek-16.4B/64-Experts MoE model deepseek-moe using 96 A100 GPUs.
  • Figure 2: Checkpoint-based fault tolerance in distributed training. A checkpoint is taken every $Ckpt_{\text{interval}} = 10$ iterations. On failure, training rolls back to the most recent complete checkpoint ($\text{CKPT}_{20}$) and recomputes lost training progress, and then resumes.
  • Figure 3: The system architecture of MoEvement
  • Figure 4: MoE training dynamics in DeepSeek-16.4B/64E deepseek-moe. (a) Token distribution (color-coded by expert) is dynamic and skewed. (b) CDF of activated experts shows that nearly all experts are active in most iterations, each receiving non-zero tokens with uneven shares.
  • Figure 5: Dense vs. Sparse checkpointing
  • ...and 11 more figures