Table of Contents
Fetching ...

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

Wendong Xu, Chujie Chen, He Xiao, Kuan Li, Jing Xiong, Chen Zhang, Wenyong Zhou, Chaofan Tao, Yang Bai, Bei Yu, Ngai Wong

TL;DR

AnchorTP tackles the fragility of TP-based LLM inference under GPU failures by decoupling long-lived state from dynamic topology and enabling fast, data-minimizing recovery. It introduces Elastic Tensor Parallelism (ETP) to allow unequal-width shard partitions across any TP degree, and a state-preserving daemon that keeps parameters and KV caches resident in GPU memory. A two-stage recovery pipeline—Continuous Minimal Migration (CMM) for planning and a topology-aware executor for execution—minimizes reloads and overlaps host reloads with P2P transfers, achieving rapid resumption without changing service interfaces. Empirical results show up to $11\times$ reduction in Time to First Success ($TFS$) and up to $59\%$ reduction in Time to Peak ($TTP$) compared with restart-and-reload, demonstrating significant practical impact for resilient LLM serving.

Abstract

Large Language Model (LLM) inference services demand exceptionally high availability and low latency, yet multi-GPU Tensor Parallelism (TP) makes them vulnerable to single-GPU failures. We present AnchorTP, a state-preserving elastic TP framework for fast recovery. It (i) enables Elastic Tensor Parallelism (ETP) with unequal-width partitioning over any number of GPUs and compatibility with Mixture-of-Experts (MoE), and (ii) preserves model parameters and KV caches in GPU memory via a daemon decoupled from the inference process. To minimize downtime, we propose a bandwidth-aware planner based on a Continuous Minimal Migration (CMM) algorithm that minimizes reload bytes under a byte-cost dominance assumption, and an execution scheduler that pipelines P2P transfers with reloads. These components jointly restore service quickly with minimal data movement and without changing service interfaces. In typical failure scenarios, AnchorTP reduces Time to First Success (TFS) by up to 11x and Time to Peak (TTP) by up to 59% versus restart-and-reload.

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

TL;DR

AnchorTP tackles the fragility of TP-based LLM inference under GPU failures by decoupling long-lived state from dynamic topology and enabling fast, data-minimizing recovery. It introduces Elastic Tensor Parallelism (ETP) to allow unequal-width shard partitions across any TP degree, and a state-preserving daemon that keeps parameters and KV caches resident in GPU memory. A two-stage recovery pipeline—Continuous Minimal Migration (CMM) for planning and a topology-aware executor for execution—minimizes reloads and overlaps host reloads with P2P transfers, achieving rapid resumption without changing service interfaces. Empirical results show up to reduction in Time to First Success () and up to reduction in Time to Peak () compared with restart-and-reload, demonstrating significant practical impact for resilient LLM serving.

Abstract

Large Language Model (LLM) inference services demand exceptionally high availability and low latency, yet multi-GPU Tensor Parallelism (TP) makes them vulnerable to single-GPU failures. We present AnchorTP, a state-preserving elastic TP framework for fast recovery. It (i) enables Elastic Tensor Parallelism (ETP) with unequal-width partitioning over any number of GPUs and compatibility with Mixture-of-Experts (MoE), and (ii) preserves model parameters and KV caches in GPU memory via a daemon decoupled from the inference process. To minimize downtime, we propose a bandwidth-aware planner based on a Continuous Minimal Migration (CMM) algorithm that minimizes reload bytes under a byte-cost dominance assumption, and an execution scheduler that pipelines P2P transfers with reloads. These components jointly restore service quickly with minimal data movement and without changing service interfaces. In typical failure scenarios, AnchorTP reduces Time to First Success (TFS) by up to 11x and Time to Peak (TTP) by up to 59% versus restart-and-reload.

Paper Structure

This paper contains 14 sections, 1 equation, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Recovery strategies when one GPU fails in a four-GPU deployment. (a) Without elastic TP, service cannot resume. (b) With elastic TP but no state preservation, service restarts with three GPUs but fully reloads parameters from host. (c) With state-preserving elastic TP, parameters/KVs on surviving GPUs are reused via planned P2P transfers with minimal reload. (d) Time breakdown for a typical restart-and-reload on Qwen3-14B qwen3technicalreport; host-to-GPU reload dominates.
  • Figure 2: AnchorTP overview with two planes. The state plane runs daemons that pin GPU memory for model parameters and the KV cache. The control plane monitors failures, plans recovery via our Continuous Minimal Migration (CMM) algorithm, and the executor coordinates data migration and system reinitialization.
  • Figure 3: Example (4$\rightarrow$3 GPUs). 1024 rows (modeled as a 1D byte interval) are split across 4 GPUs. After GPU:2 fails, the target plan is $[0,341)$, $[341,682)$, $[682,1024)$. GPU:1 keeps $[0,256)$ and reloads $[256,341)$; GPU:3 reloads $[341,512)$ and keeps $[512,682)$; GPU:4 receives $[682,768)$ via P2P from GPU:3 and keeps $[768,1024)$. Only 256 rows are reloaded; the rest use P2P.
  • Figure 4: Example of EPLB rebalancing after a failure. (a) Initially, 4 GPUs are perfectly balanced. (b) After GPU 1 fails, AnchorTP re-shards the parameters, leaving GPU 3 with a smaller shard and thus more free compute resources. (c) EPLB, aware of this, intelligently places the recovered Expert 1 and new replicas of the hotspot Expert 0 onto the most idle GPU (GPU 3), achieving a new, performance-optimal state that is not arithmetically balanced but maximizes system throughput.
  • Figure 5: Per-switch TFS and TTP as TP decreases (k$\rightarrow$k$-$1) for Qwen3-30B-A3B and Mixtral-8$\times$22B. Lower is better.
  • ...and 1 more figures