ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

Haley Li; Xinglu Wang; Cong Feng; Chunxu Zuo; Yanan Wang; Hei Lo; Yufei Cui; Bingji Wang; Duo Cui; Shuming Jing; Yizhou Shan; Ying Xiong; Jiannan Wang; Yong Zhang; Zhenan Fan

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

Haley Li, Xinglu Wang, Cong Feng, Chunxu Zuo, Yanan Wang, Hei Lo, Yufei Cui, Bingji Wang, Duo Cui, Shuming Jing, Yizhou Shan, Ying Xiong, Jiannan Wang, Yong Zhang, Zhenan Fan

TL;DR

ReviveMoE is a method for rapid failure recovery in large-scale LLM deployments without restarting the serving instance, designed to support both the traditional LLM architecture, which collocates MoE and attention on the same hardware, and the disaggregated architectures.

Abstract

As LLM deployments scale over more hardware, the probability of a single failure in a system increases significantly, and cloud operators must consider robust countermeasures to handle these inevitable failures. A common recovery approach is to simply restart the LLM serving instance; however, this is costly in model-as-a-service (MaaS) inference settings, where reloading model weights and recompiling computation graphs can introduce significant delays to incoming requests. We propose ReviveMoE, a method for rapid failure recovery in large-scale LLM deployments without restarting the serving instance. ReviveMoE is designed to support both the traditional LLM architecture, which collocates MoE and attention on the same hardware, and the disaggregated architectures, which separate MoE from attention. Integrated into Huawei Cloud's MaaS, ReviveMoE is built on top of Huawei's xDeepServe serving platform and the XCCL communications library.

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

TL;DR

Abstract

Paper Structure (22 sections, 6 figures, 2 tables)

This paper contains 22 sections, 6 figures, 2 tables.

Introduction
Background
Mixture of Experts Models
Disaggregating MoE and Attention
XCCL
Graph Execution
ReviveMoE Design
Failure Detection
Sequence State Recovery
Block Table Recovery
Weight Integrity
Recreating Communications
Graph Mode
Evaluation
Recovery Time
...and 7 more sections

Figures (6)

Figure 1: Breakdown of the time taken in seconds for a cached reinitialization of a DeepSeek V3 instance on 80 NPUs (total 83.1 s). "Compile" refers to a cached compile.
Figure 2: Overview of a FlowServe inference instance.
Figure 3: ReviveMoE design under an attention failure scenario in a MA-disaggregated deployment. NPU1 experiences a failure. The engine does not receive the heartbeat from DPExecutor 1, so it initiates recovery. Requests get migrated from DPExecutor 1 to other DPExecutors, and DPExecutor 1 gets terminated. The communications domain is destroyed and reinitialized without NPU1. The graph cache is loaded from disk and a cached compilation builds the computation graph. The block table is restored on all DPExecutors and inference can begin again.
Figure 4: Flowchart for deciding action when a failure involves MoE weights.
Figure 5: Recovery times for various ReviveMoE scenarios. In the MA-disaggregated scenario, the brackets indicate the module failed and MoE recovery technique, if applicable. \ref{['tab:labels']} further explains each timing category.
...and 1 more figures

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

TL;DR

Abstract

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments

Authors

TL;DR

Abstract

Table of Contents

Figures (6)