Table of Contents
Fetching ...

FailSafe: High-performance Resilient Serving

Ziyi Xu, Zhiqiang Xie, Swapnil Gandhi, Christos Kozyrakis

TL;DR

FailSafe tackles the fragility of tensor-parallel LLM inference under irregular GPU availability by introducing Cyclic KVCache placement, Hybrid Attention, and Fine-Grained Load-Aware Routing to balance memory and compute. It also deploys Proactive KVCache backup and On-Demand Weight Recovery to dramatically reduce recovery latency. In experiments on an 8xH100 DGX, FailSafe achieves up to 2x throughput and up to 183x faster recovery, maintaining strong performance even with up to three GPU failures. The work promises practical resilience for large multi-GPU deployments and suggests applicability to future architectures and larger heterogeneous systems such as NVL72.

Abstract

Tensor parallelism (TP) enables large language models (LLMs) to scale inference efficiently across multiple GPUs, but its tight coupling makes systems fragile: a single GPU failure can halt execution, trigger costly KVCache recomputation, and introduce long-term compute and memory imbalance. We present FailSafe, a fault-tolerant TP serving system that sustains high performance under irregular GPU availability. FailSafe introduces three techniques to balance computation and memory across GPUs: (1) Cyclic KVCache Placement for uniform memory utilization, (2) Hybrid Attention combining tensor- and data-parallel attention to eliminate stragglers, and (3) Fine-Grained Load-Aware Routing to dynamically balance requests. It further employs proactive KVCache backup and on-demand weight recovery to avoid expensive recomputation and redundant data transfers. We implement these techniques in a lightweight serving engine compatible with existing LLM infrastructures. Evaluated on an 8xH100 DGX system with real-world fault traces and representative workloads, FailSafe achieves up to 2x higher throughput and two orders of magnitude lower recovery latency compared to standard fault handling approaches. Even with up to three GPU failures, FailSafe sustains high throughput and balanced utilization, demonstrating robust and efficient LLM serving under dynamic and unreliable hardware conditions.

FailSafe: High-performance Resilient Serving

TL;DR

FailSafe tackles the fragility of tensor-parallel LLM inference under irregular GPU availability by introducing Cyclic KVCache placement, Hybrid Attention, and Fine-Grained Load-Aware Routing to balance memory and compute. It also deploys Proactive KVCache backup and On-Demand Weight Recovery to dramatically reduce recovery latency. In experiments on an 8xH100 DGX, FailSafe achieves up to 2x throughput and up to 183x faster recovery, maintaining strong performance even with up to three GPU failures. The work promises practical resilience for large multi-GPU deployments and suggests applicability to future architectures and larger heterogeneous systems such as NVL72.

Abstract

Tensor parallelism (TP) enables large language models (LLMs) to scale inference efficiently across multiple GPUs, but its tight coupling makes systems fragile: a single GPU failure can halt execution, trigger costly KVCache recomputation, and introduce long-term compute and memory imbalance. We present FailSafe, a fault-tolerant TP serving system that sustains high performance under irregular GPU availability. FailSafe introduces three techniques to balance computation and memory across GPUs: (1) Cyclic KVCache Placement for uniform memory utilization, (2) Hybrid Attention combining tensor- and data-parallel attention to eliminate stragglers, and (3) Fine-Grained Load-Aware Routing to dynamically balance requests. It further employs proactive KVCache backup and on-demand weight recovery to avoid expensive recomputation and redundant data transfers. We implement these techniques in a lightweight serving engine compatible with existing LLM infrastructures. Evaluated on an 8xH100 DGX system with real-world fault traces and representative workloads, FailSafe achieves up to 2x higher throughput and two orders of magnitude lower recovery latency compared to standard fault handling approaches. Even with up to three GPU failures, FailSafe sustains high throughput and balanced utilization, demonstrating robust and efficient LLM serving under dynamic and unreliable hardware conditions.

Paper Structure

This paper contains 22 sections, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the proposed cyclic placement for balancing KVCache memory usage across GPUs. In this example, the model has 4 key-value heads and deploys non-uniform TP3. KV$_{i}$ stands for the KVCache for $i$-th key value head. Cyclic placement (bottom) improves overall KVCache capacity by approximately 50%, compared with a naïve placement (top).
  • Figure 2: Illustration of the proposed hybrid attention. In this example, the model has 4 key-value heads and deploys non-uniform TP3. A$_i$ stands for the $i$-th head's computation for request $A$. Request $A$ is routed to GPU$_0$, $B$ to GPU$_1$, and $C$ to GPU$_2$. Hybrid attention (bottom) combine TP attention with DP attention, significantly reducing GPU idle time and improving GPU utilization.
  • Figure 3: Illustration of the load-aware router and scheduler. In this example, request 0 has 4 tokens, request 1 and 2 has 1 token, and a new request 3 with 1 token arrives. In the naïve setting (top), a round-robin router combined with a FIFO chunked prefill scheduler results in an highly unbalanced prefill batch: within the prefill token budget (which is 3), only a chunk of request 0 is scheduled in the prefill batch. In contrast, our load-aware router (bottom) dynamically redirects new requests to the least-loaded GPU, and our adaptive chunked prefill mechanism helps form a balanced batch.
  • Figure 4: On-demand Recovery mechanism. In this example, the FFN weights are divided into 12 shards and there're 4 attention heads with corresponding KVCache. The system starts with normal TP4. When GPU 3 fails, we will restore all the lost state (weights and KVCache) via PCIe. Our On-demand recovery mechanism eliminate all redundant PCIe transfer.
  • Figure 5: GPU availability from GCP cloud availability traces.
  • ...and 6 more figures