Table of Contents
Fetching ...

CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems

Panagiotis Georgios Pennas, Konstantinos Papaioannou, Marco Guarnieri, Thaleia Dimitra Doudali

TL;DR

CacheSolidarity is presented, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency, and demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.

Abstract

Large Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user's request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents CacheSolidarity, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. CacheSolidarity monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that CacheSolidarity enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. CacheSolidarity's lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.

CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems

TL;DR

CacheSolidarity is presented, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency, and demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.

Abstract

Large Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user's request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents CacheSolidarity, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. CacheSolidarity monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that CacheSolidarity enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. CacheSolidarity's lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.
Paper Structure (32 sections, 10 figures, 3 tables)

This paper contains 32 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Timing side-channel leakage in prefix-sharing LLM inference. The attacker sends crafted prompts and measures time-to-first-token (TTFT) to detect cache hits or misses caused by Automatic Prefix Caching (APC) and steal the sensitive information in the victim's prompt.
  • Figure 2: TTFT difference between cache hits (red) and misses (blue) for increasing length of prefixes/prompts reused across users. Examples for different LLM models and system load (requests per second RPS).
  • Figure 3: Effect of the LLM model, prefix/prompt length and system load (requests per second) on the distinguishability of the APC timing differences, which is captured with the KDE overlap between cache hits and misses.
  • Figure 4: System Design of CacheSolidarity.
  • Figure 5: Example workflow of CacheSolidarity.
  • ...and 5 more figures