Table of Contents
Fetching ...

CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

Jiawei Yi, Ping Gong, Youhui Bai, Jiaqi Ruan, Shengnan Wang, Pengcheng Wang, Haibo Wang, Weiguang Wang, Xia Zhu, Feng Wu, Cheng Li

TL;DR

CLO addresses the KVCache memory and data-transfer bottlenecks in long-context LLM inference by combining a CPU-light KVCache offloading algorithm with system-level optimizations. The key innovations are a head-wise approximate on-GPU cache powered by query similarity, head-importance aware adaptive thresholds, and selective persistent caching, complemented by a zero-copy PCIe transfer engine and GPU-centric synchronization. Empirical results show CLO achieves comparable accuracy to state-of-the-art systems while delivering substantial throughput gains (9.3%–66.6%), near peak PCIe bandwidth, and negligible cache-management overhead. The work demonstrates that careful algorithm-system co-design is essential to fully exploit modern GPU platforms for memory-constrained LLM inference, and CLO is open-sourced for community use.

Abstract

The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at https://github.com/CommediaJW/CLO.

CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

TL;DR

CLO addresses the KVCache memory and data-transfer bottlenecks in long-context LLM inference by combining a CPU-light KVCache offloading algorithm with system-level optimizations. The key innovations are a head-wise approximate on-GPU cache powered by query similarity, head-importance aware adaptive thresholds, and selective persistent caching, complemented by a zero-copy PCIe transfer engine and GPU-centric synchronization. Empirical results show CLO achieves comparable accuracy to state-of-the-art systems while delivering substantial throughput gains (9.3%–66.6%), near peak PCIe bandwidth, and negligible cache-management overhead. The work demonstrates that careful algorithm-system co-design is essential to fully exploit modern GPU platforms for memory-constrained LLM inference, and CLO is open-sourced for community use.

Abstract

The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at https://github.com/CommediaJW/CLO.

Paper Structure

This paper contains 29 sections, 10 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overview of RetroInfer chen2025retroinfer's caching policy. PQCache zhang2025pqcache follows a similar workflow but performs cache lookup on GPU.
  • Figure 2: Per-layer decoding latency breakdown of existing works and the ideal case (full KVCache resides in GPU HBM). Sequence length=128K, batch size=1 and top-$k$ ratio=10%. Evaluated on Llama3-8B-1048K model gradientlongcontextllama3.
  • Figure 3: Achieved PCIe bandwidth of host data gathering and transfer process in InfiniGen and PQCache. Batch size=1 and top-$k$ ratio=10%, evaluated on Llama3-8B-1048K model gradientlongcontextllama3, with 64 CPU threads for data gathering.
  • Figure 4: The cosine similarity distribution among query vectors. Evaluated on the decoding phase of a random sampled sequence. "L02 H07" means layer2 and head7. Adjacent queries (near the diagonal) exhibit high cosine similarity.
  • Figure 5: The average cosine similarity of queries from adjacent decoding steps. Evaluated on 30 random sampled sequences. High directional similarity is observed across models, layers and heads.
  • ...and 9 more figures