Table of Contents
Fetching ...

FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

Bingzhe Zhao, Ke Cheng, Aomufei Yuan, Yuxuan Tian, Ruiguang Zhong, Chengchen Hu, Tong Yang, Lian Yu

TL;DR

FairKV addresses the load imbalance caused by imbalanced per-head KV cache compression in multi-GPU Transformer inference. It introduces Best-effort Assignment and Fair-Copying to balance per-head workloads by reassembling attention heads and replicating heavy heads via Data Parallelism, guided by KV-cache statistics. The optimizer uses a recursive backtracking search to find near-optimal allocations, and experiments show throughput improvements up to 1.66x on models like LLaMA-3.3-70B-Instruct and Mistral-24B, with robust gains across tensor-parallel configurations and KV budgets. This approach offers practical, scalable improvements for real-time multi-GPU inference and will be released as open source upon acceptance.

Abstract

KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.

FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference

TL;DR

FairKV addresses the load imbalance caused by imbalanced per-head KV cache compression in multi-GPU Transformer inference. It introduces Best-effort Assignment and Fair-Copying to balance per-head workloads by reassembling attention heads and replicating heavy heads via Data Parallelism, guided by KV-cache statistics. The optimizer uses a recursive backtracking search to find near-optimal allocations, and experiments show throughput improvements up to 1.66x on models like LLaMA-3.3-70B-Instruct and Mistral-24B, with robust gains across tensor-parallel configurations and KV budgets. This approach offers practical, scalable improvements for real-time multi-GPU inference and will be released as open source upon acceptance.

Abstract

KV cache techniques in Transformer models aim to reduce redundant computations at the expense of substantially increased memory usage, making KV cache compression an important and popular research topic. Recently, state-of-the-art KV cache compression methods implement imbalanced, per-head allocation algorithms that dynamically adjust the KV cache budget for each attention head, achieving excellent performance in single-GPU scenarios. However, we observe that such imbalanced compression leads to significant load imbalance when deploying multi-GPU inference, as some GPUs become overburdened while others remain underutilized. In this paper, we propose FairKV, a method designed to ensure fair memory usage among attention heads in systems employing imbalanced KV cache compression. The core technique of FairKV is Fair-Copying, which replicates a small subset of memory-intensive attention heads across GPUs using data parallelism to mitigate load imbalance. Our experiments on popular models, including LLaMA 70b and Mistral 24b model, demonstrate that FairKV increases throughput by 1.66x compared to standard tensor parallelism inference. Our code will be released as open source upon acceptance.

Paper Structure

This paper contains 28 sections, 5 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Impact of Batchsize and KV Cache Budget on Inference Latency
  • Figure 2: Illustration of different head allocation strategies for multi-GPU inference in large-scale transformer models. The figure shows the following strategies: (1) Static Head Allocation (SHA), where attention heads are evenly distributed across GPUs without considering computational load; (2) Load-Aware Head Allocation (FairKV-NoDP), where attention heads are allocated based on their computational load, ensuring a balanced GPU busy rate; (3) Load-Aware Head Allocation with DataParallel (FairKV-DP), where heads are replicated across GPUs for improved load distribution and efficiency;
  • Figure 3: Throughput Gain Rates of FairKV on different models, where the throughput of the baseline model is regarded as 1.0.
  • Figure 4: Ablation Test Among Standard Model, FairKV w/o Fair-copying and FairKV with Fair-copying
  • Figure 5: GPU Busy Rate with different Data Parallel Size on LLaMA-3.3-70B.
  • ...and 1 more figures