Efficient Remote Prefix Fetching with GPU-native Media ASICs
Liang Mi, Weijun Wang, Jinghan Chen, Ting Cao, Haipeng Dai, Yunxin Liu
TL;DR
The paper tackles the latency bottleneck of remote KV cache reuse in large language model serving under bandwidth constraints. It introduces KVFetcher, which combines a codec-friendly, two-stage KV compression layout with a GPU-native video codec-based pipeline to compress KV caches and an efficient, non-blocking remote fetcher with adaptive-resolution streaming and frame-wise restoration. Key contributions include (1) a codec-aware inter- and intra-frame KV layout that achieves lossless, high-ratio compression, (2) a fetching-aware scheduler, adaptive-resolution KV fetching, and frame-wise KV restoration to hide network and decoding latency, and (3) multi-model, multi-GPU evaluations showing TTFT reductions up to $3.51\times$ across 1–40 Gbps networks while maintaining accuracy. The approach enables scalable, widely deployable remote KV cache reuse for long-context LLMs, reducing serving costs and latency without requiring new hardware acquisitions.
Abstract
Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVFetcher, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention, masking network fluctuations, and achieving minimum time-to-first-token (TTFT). We prototype KVFetcher on diverse GPUs from high- to low-end. Experiments reveal that it reduces TTFT by up to 3.51 times while maintaining lossless accuracy, compared to SOTA methods.
