Table of Contents
Fetching ...

The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems

Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Chen, Wei Song, Yier Jin, Dan Meng, Rui Hou

TL;DR

The paper tackles the problem of privacy leakage in multi-tenant LLM serving systems caused by timing side channels introduced by shared KV and semantic caches. It develops two attack paradigms, Prompt Stealing Attacks (PSA) and Peeping Neighbor Attacks (PNA), and provides offline timing analyses, online classifiers, and token-by-token/greedy-search strategies to recover system and peer prompts. Through experiments on open-source and commodity LLMs, the authors demonstrate feasible prompt recovery, semantic attribute leakage, and cross-user document exposure, alongside a black-box measurement study across real providers. They propose mitigations including batching KV-cache sharing to at least $k$ tokens and anonymizing private attributes in semantic search, with modest performance overhead and improved privacy. The work highlights practical privacy risks in current LLM-serving infrastructures and calls for security-conscious design in caching and scheduling to protect user data at scale.

Abstract

The wide deployment of Large Language Models (LLMs) has given rise to strong demands for optimizing their inference performance. Today's techniques serving this purpose primarily focus on reducing latency and improving throughput through algorithmic and hardware enhancements, while largely overlooking their privacy side effects, particularly in a multi-user environment. In our research, for the first time, we discovered a set of new timing side channels in LLM systems, arising from shared caches and GPU memory allocations, which can be exploited to infer both confidential system prompts and those issued by other users. These vulnerabilities echo security challenges observed in traditional computing systems, highlighting an urgent need to address potential information leakage in LLM serving infrastructures. In this paper, we report novel attack strategies designed to exploit such timing side channels inherent in LLM deployments, specifically targeting the Key-Value (KV) cache and semantic cache widely used to enhance LLM inference performance. Our approach leverages timing measurements and classification models to detect cache hits, allowing an adversary to infer private prompts with high accuracy. We also propose a token-by-token search algorithm to efficiently recover shared prompt prefixes in the caches, showing the feasibility of stealing system prompts and those produced by peer users. Our experimental studies on black-box testing of popular online LLM services demonstrate that such privacy risks are completely realistic, with significant consequences. Our findings underscore the need for robust mitigation to protect LLM systems against such emerging threats.

The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems

TL;DR

The paper tackles the problem of privacy leakage in multi-tenant LLM serving systems caused by timing side channels introduced by shared KV and semantic caches. It develops two attack paradigms, Prompt Stealing Attacks (PSA) and Peeping Neighbor Attacks (PNA), and provides offline timing analyses, online classifiers, and token-by-token/greedy-search strategies to recover system and peer prompts. Through experiments on open-source and commodity LLMs, the authors demonstrate feasible prompt recovery, semantic attribute leakage, and cross-user document exposure, alongside a black-box measurement study across real providers. They propose mitigations including batching KV-cache sharing to at least tokens and anonymizing private attributes in semantic search, with modest performance overhead and improved privacy. The work highlights practical privacy risks in current LLM-serving infrastructures and calls for security-conscious design in caching and scheduling to protect user data at scale.

Abstract

The wide deployment of Large Language Models (LLMs) has given rise to strong demands for optimizing their inference performance. Today's techniques serving this purpose primarily focus on reducing latency and improving throughput through algorithmic and hardware enhancements, while largely overlooking their privacy side effects, particularly in a multi-user environment. In our research, for the first time, we discovered a set of new timing side channels in LLM systems, arising from shared caches and GPU memory allocations, which can be exploited to infer both confidential system prompts and those issued by other users. These vulnerabilities echo security challenges observed in traditional computing systems, highlighting an urgent need to address potential information leakage in LLM serving infrastructures. In this paper, we report novel attack strategies designed to exploit such timing side channels inherent in LLM deployments, specifically targeting the Key-Value (KV) cache and semantic cache widely used to enhance LLM inference performance. Our approach leverages timing measurements and classification models to detect cache hits, allowing an adversary to infer private prompts with high accuracy. We also propose a token-by-token search algorithm to efficiently recover shared prompt prefixes in the caches, showing the feasibility of stealing system prompts and those produced by peer users. Our experimental studies on black-box testing of popular online LLM services demonstrate that such privacy risks are completely realistic, with significant consequences. Our findings underscore the need for robust mitigation to protect LLM systems against such emerging threats.
Paper Structure (19 sections, 2 equations, 12 figures, 6 tables)

This paper contains 19 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: LLM API servers like OpenAI allow user input through both direct requests (top) and synthesized requests via a template (bottom).
  • Figure 2: Overview of prompt stealing attacks.
  • Figure 3: Latency distribution of one token hit and miss.
  • Figure 4: UMAP projection demonstrating the dataset's heterogeneity: As sampling density increases, the visualization reveals expanded dimensional ranges with distinct semantic clusters and isolated points, highlighting the complex multidimensional structure of the prompt dataset.
  • Figure 5: Code for measuring response latency in PSA.
  • ...and 7 more figures