Table of Contents
Fetching ...

Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, Wei Zhang

TL;DR

This paper tackles the privacy risk introduced by global KV-cache sharing in multi-tenant LLM inference by proposing SafeKV, a system that co-designs privacy detection with cache management. It introduces a three-tier asynchronous detection pipeline, a unified radix-tree memory manager with path compression and progressive eviction, and an RDR-guided runtime safeguard to bound leakage. Evaluation shows SafeKV can reduce TTFT overhead compared to full isolation by up to 40.58% and boost throughput by up to 2.66x while preserving most cache reuse benefits. The approach provides practical, scalable privacy for multi-tenant LLM serving without sacrificing inference latency or reuse efficiency.

Abstract

Global KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading to cross-tenant privacy risks. To address this problem, we introduce SafeKV (Secure and Flexible KV-cache Sharing), a system-level co-design of privacy enforcement and KV-cache management. SafeKV integrates lightweight detection and isolation directly into the serving runtime to eliminate cross-tenant reuse of sensitive KV-cache blocks under our threat model, while recovering most of the performance benefits of global sharing. Our key contributions are: (1) a three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads, (2) a unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective isolation, and (3) an RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage. On large LLM backends, SafeKV reduces the time-to-first-token (TTFT) overhead compared to full isolation by up to 40.58% and raises throughput by up to 2.66x. Overall, SafeKV restores the efficiency of KV reuse while enforcing strong, practical privacy for multi-tenant LLM inference.

Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

TL;DR

This paper tackles the privacy risk introduced by global KV-cache sharing in multi-tenant LLM inference by proposing SafeKV, a system that co-designs privacy detection with cache management. It introduces a three-tier asynchronous detection pipeline, a unified radix-tree memory manager with path compression and progressive eviction, and an RDR-guided runtime safeguard to bound leakage. Evaluation shows SafeKV can reduce TTFT overhead compared to full isolation by up to 40.58% and boost throughput by up to 2.66x while preserving most cache reuse benefits. The approach provides practical, scalable privacy for multi-tenant LLM serving without sacrificing inference latency or reuse efficiency.

Abstract

Global KV-cache sharing is an effective optimization for accelerating large language model (LLM) inference, yet it introduces an API-visible timing side channel that lets adversaries infer sensitive user inputs from shared entries, leading to cross-tenant privacy risks. To address this problem, we introduce SafeKV (Secure and Flexible KV-cache Sharing), a system-level co-design of privacy enforcement and KV-cache management. SafeKV integrates lightweight detection and isolation directly into the serving runtime to eliminate cross-tenant reuse of sensitive KV-cache blocks under our threat model, while recovering most of the performance benefits of global sharing. Our key contributions are: (1) a three-tier asynchronous detection pipeline that decouples privacy classification from inference and supports streaming workloads, (2) a unified radix-tree-based memory manager with path compression and sensitivity-aware eviction for scalable selective isolation, and (3) an RDR-guided (Reuse Diversity Ratio) runtime safeguard that detects and bounds residual leakage. On large LLM backends, SafeKV reduces the time-to-first-token (TTFT) overhead compared to full isolation by up to 40.58% and raises throughput by up to 2.66x. Overall, SafeKV restores the efficiency of KV reuse while enforcing strong, practical privacy for multi-tenant LLM inference.

Paper Structure

This paper contains 48 sections, 3 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Attack Overview.
  • Figure 2: Normalized performance of TTFT between SGLang (global-sharing) and Cache-Partition (isolated-per-user).
  • Figure 3: Comparison of self-attention computation mechanisms. The traditional approach (upper) performs full recomputation for each token, while the KV-cache (lower) reuses stored key-value vectors to accelerate inference. The KV-cache reduces the computational complexity per decoding step from $O(n^2)$ to $O(n)$.
  • Figure 4: The Architecture Overview of SafeKV.
  • Figure 5: Latency of lightweight general LLM models for PII detection
  • ...and 10 more figures