Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Tuowei Wang; Liyun Chu; Ruwen Fan; Ju Ren

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Tuowei Wang, Liyun Chu, Ruwen Fan, Ju Ren

Abstract

The key-value (KV) cache has become the dominant contributor to memory consumption in large language model (LLM) inference. Although offloading KVCache from GPU high-bandwidth memory (HBM) to CPU DRAM alleviates device memory pressure, DRAM remains capacity-limited and costly for large, persistent workloads. Solid-state drives (SSDs) provide a cost-effective alternative, but naive SSD-based paging is fundamentally bandwidth-bound due to limited PCIe throughput and per-device bandwidth constraints. In this paper, we observe that KVCache activations in real-world workloads exhibit strong and stable correlations. We term this phenomenon KVCache Co-Activation, where accessing a KV entry is often accompanied by a stable and recurring set of other KV entries. Leveraging this property, we present Swarm, an SSD-based KVCache offloading system that converts bandwidth-bound single-device access into parallel I/O across multiple SSDs. Specifically, Swarm clusters co-activated KV entries offline and distributes the resulting clusters across SSDs using graph-based placement with selective replication to maximize parallel I/O bandwidth. At runtime, Swarm performs load-balanced cluster retrieval and dynamically adapts clustering and caching decisions to sustain high bandwidth utilization under evolving access patterns. Evaluations show that Swarm reduces I/O time by 2.41x and improves effective bandwidth utilization by 2.72x.

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Abstract

Paper Structure (23 sections, 9 equations, 20 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 9 equations, 20 figures, 5 tables, 1 algorithm.

Introduction
Background
KVCache in Autoregressive Generation
Three-Tier KVCache Offloading
Sparsity-Driven KVCache Activation
Motivation and Challenges
Insight: KVCache Co-Activation
Main Challenges
Design Overview
Offline Phase: Modeling and Placement
Correlation-Aware Clustering
Offloading-Friendly Partition
Online Phase: Retrieval and Update
Load-Balanced Scheduling
Cluster-Aligned Adaptation
...and 8 more sections

Figures (20)

Figure 1: Memory hierarchy for KVCache offloading. Swarm exploits KVCache co-activation to aggregate multi-SSD bandwidth for better performance–scalability trade-offs.
Figure 2: TTFT breakdown for Qwen3-4B with no KVCache, SSD-based KVCache, and DRAM-based KVCache.
Figure 3: KVCache activation sparsity. (a) Perplexity on WikiText (2K context) under varying KV activation ratios. (b) Dynamic activation patterns of KVCache across decoding steps.
Figure 4: Visualization of KVCache co-activation patterns across models and datasets. Each matrix element $(i, j)$ denotes the activation frequency of KVCache entries $e_i$ and $e_j$.
Figure 5: (a) KVCache co-activation distribution on Qwen3-32B over 128 queries. (b) Retrieval paths' impact on effective bandwidth across layers. (c) KVCache co-activation probability degradation across decoding steps under naive updates.
...and 15 more figures

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Abstract

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Authors

Abstract

Table of Contents

Figures (20)