Efficient Remote Prefix Fetching with GPU-native Media ASICs

Liang Mi; Weijun Wang; Jinghan Chen; Ting Cao; Haipeng Dai; Yunxin Liu

Efficient Remote Prefix Fetching with GPU-native Media ASICs

Liang Mi, Weijun Wang, Jinghan Chen, Ting Cao, Haipeng Dai, Yunxin Liu

TL;DR

The paper tackles the latency bottleneck of remote KV cache reuse in large language model serving under bandwidth constraints. It introduces KVFetcher, which combines a codec-friendly, two-stage KV compression layout with a GPU-native video codec-based pipeline to compress KV caches and an efficient, non-blocking remote fetcher with adaptive-resolution streaming and frame-wise restoration. Key contributions include (1) a codec-aware inter- and intra-frame KV layout that achieves lossless, high-ratio compression, (2) a fetching-aware scheduler, adaptive-resolution KV fetching, and frame-wise KV restoration to hide network and decoding latency, and (3) multi-model, multi-GPU evaluations showing TTFT reductions up to $3.51\times$ across 1–40 Gbps networks while maintaining accuracy. The approach enables scalable, widely deployable remote KV cache reuse for long-context LLMs, reducing serving costs and latency without requiring new hardware acquisitions.

Abstract

Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVFetcher, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention, masking network fluctuations, and achieving minimum time-to-first-token (TTFT). We prototype KVFetcher on diverse GPUs from high- to low-end. Experiments reveal that it reduces TTFT by up to 3.51 times while maintaining lossless accuracy, compared to SOTA methods.

Efficient Remote Prefix Fetching with GPU-native Media ASICs

TL;DR

across 1–40 Gbps networks while maintaining accuracy. The approach enables scalable, widely deployable remote KV cache reuse for long-context LLMs, reducing serving costs and latency without requiring new hardware acquisitions.

Abstract

Paper Structure (26 sections, 1 equation, 28 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 1 equation, 28 figures, 3 tables, 1 algorithm.

Introduction
Motivation and Challenges
Preliminary of Remote KV Cache Reuse
Limitations of Existing Remote KV Cache Reuse Systems
Opportunity of GPU-native Video Codec
Challenges of Compressed KV Streaming with GPU-native Video Codec
KVFetcher Design
System Overview
Codec-friendly KV Compression
Inter-frame layout.
Intra-frame layout.
Efficient Remote KV Fetching
Fetching-aware scheduler.
Efficient KV Decompression.
Implementation
...and 11 more sections

Figures (28)

Figure 1: Solutions of remote KV cache reuse. Our KVFetcher exploits GPU-native video codecs, delivering the best cost-efficiency and system performance.
Figure 2: Current three prefilling types: full prefill, raw KV reuse, compressed KV reuse. Their time costs are: prefill, transmission+prefill, transmission+decompression+prefill.
Figure 3: "Winning areas" of three prefilling types under various bandwidths and context lengths. KVFetcher significantly extends the applicable scope of compressed KV reuse.
Figure 4: Concurrent LLM inference and KV decompression cause extra delay.
Figure 5: Kernel switch yields SM underutilization and memory I/O contention.
...and 23 more figures

Efficient Remote Prefix Fetching with GPU-native Media ASICs

TL;DR

Abstract

Efficient Remote Prefix Fetching with GPU-native Media ASICs

Authors

TL;DR

Abstract

Table of Contents

Figures (28)