Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

Yongji Wu; Wenjie Qu; Xueshen Liu; Tianyang Tao; Yifan Qiao; Zhuang Wang; Wei Bai; Yuan Tian; Jiaheng Zhang; Z. Morley Mao; Matthew Lentz; Danyang Zhuo; Ion Stoica

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

Yongji Wu, Wenjie Qu, Xueshen Liu, Tianyang Tao, Yifan Qiao, Zhuang Wang, Wei Bai, Yuan Tian, Jiaheng Zhang, Z. Morley Mao, Matthew Lentz, Danyang Zhuo, Ion Stoica

TL;DR

Lazarus tackles the fragility of training sparsely activated Mixture-of-Experts (MoE) models on volatile cloud environments by introducing adaptive expert replica allocation and a provably optimal Maximum Rank Overlap (MRO) placement that maximizes recovery probability under node failures. It couples this with a flexible, CUDA-based token dispatcher and efficient, greedy reconfiguration to fully utilize remaining GPUs after failures, avoiding checkpoint-restart penalties. The system is implemented in PyTorch and evaluated across multiple MoE scales, showing up to 5.7x throughput gains under frequent failures and 3.4x on real spot traces compared with checkpoint-based baselines, and outperforming Tutel-based variants in large clusters. Overall, Lazarus enables cost-effective, fault-tolerant MoE training on unreliable hardware, with practical implications for scaling LLMs in public clouds.

Abstract

Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs). However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to idle wait until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. This problem is exacerbated by the growing use of spot instances on public clouds for model training, which despite offering substantial cost savings, introduce frequent preemptions-essentially failures that regularly occur throughout the training process. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

TL;DR

Abstract

Paper Structure (25 sections, 2 theorems, 3 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 2 theorems, 3 equations, 14 figures, 2 tables, 1 algorithm.

Introduction
Background and Motivation
MoE Models and Expert Parallelism
Fault-Tolerant and Elastic Training
System Overview
Design
Adaptive Expert Allocation and Placement
Flexible Token Dispatcher
Efficient Reconfiguration
Implementation
Evaluation
Setups
Controlled Single Node Failures
Controlled Multi Node Failures
Spot Instance Trace
...and 10 more sections

Key Result

Theorem 1

For any MRO plan $T$ and $R$, given the number of replicas $r_e$ for each expert $e$, $T$ maximizes the recovery probability $\mathsf{Pr}(\bigcup_{a \in A}Col_a=[E])$, where $[E]$ is the set of experts, $Col_a$ is the set of replicas assigned to node $a$, $A$ is a uniformly sampled set of $R$ nodes

Figures (14)

Figure 1: MoE architecture utilizes expert parallelism for distributed training, yet it also suffers from imbalanced workload due to the dynamic nature of gate networks.
Figure 2: Expert loads on a 16 experts model (GPT-L in \ref{['sec:exp-setup']}). The distribution varies during training and across layers.
Figure 3: System architecture of Lazarus.
Figure 4: Fault resiliency depends on how expert replicas are placed. With the same replica allocation of 4 experts and 4 replica slots per node, placement plan A and B differ in recovery probability under 3 node failures.
Figure 5: Lazarus minimizes the failure probability by minimizing the number of vertices representing node failures that have incident edges. Here we consider 3 node failures. Comparing Case I and II, when expert overlap on nodes is not maximized, there are more unique failure patterns. Comparing Case I and III, swapping any expert also leads to more failure patterns.
...and 9 more figures

Theorems & Definitions (2)

Theorem 1
Theorem 1

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

TL;DR

Abstract

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (2)