Table of Contents
Fetching ...

InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks

Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu, Qi Guo

TL;DR

InputSnatch demonstrates that common cache-sharing optimizations in LLM inference create practical timing side channels capable of reconstructing private user prompts. The authors develop a dual-pronged attack framework—Input Constructor and Time Analyzer—that exploit prefix and semantic caching to recover inputs with high accuracy under realistic constraints. Extensive experiments on vLLM and GPTCache show robust timing patterns enabling field-level recovery and semantic-content leakage across medical and legal domains, underscoring privacy risks in cloud-based inference. The work highlights a critical trade-off between performance optimization and privacy, and proposes defenses such as per-user cache isolation, rate limiting, and timing obfuscation to mitigate these vulnerabilities. Overall, the paper offers a thorough analysis of cache-based timing leaks in LLM services and provides concrete guidance for securing production deployments against such side-channel threats.

Abstract

Large language models (LLMs) possess extensive knowledge and question-answering capabilities, having been widely deployed in privacy-sensitive domains like finance and medical consultation. During LLM inferences, cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests. However, we identify that these cache mechanisms pose a risk of private input leakage, as the caching can result in observable variations in response times, making them a strong candidate for a timing-based attack hint. In this study, we propose a novel timing-based side-channel attack to execute input theft in LLMs inference. The cache-based attack faces the challenge of constructing candidate inputs in a large search space to hit and steal cached user queries. To address these challenges, we propose two primary components. The input constructor employs machine learning techniques and LLM-based approaches for vocabulary correlation learning while implementing optimized search mechanisms for generalized input construction. The time analyzer implements statistical time fitting with outlier elimination to identify cache hit patterns, continuously providing feedback to refine the constructor's search strategy. We conduct experiments across two cache mechanisms and the results demonstrate that our approach consistently attains high attack success rates in various applications. Our work highlights the security vulnerabilities associated with performance optimizations, underscoring the necessity of prioritizing privacy and security alongside enhancements in LLM inference.

InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks

TL;DR

InputSnatch demonstrates that common cache-sharing optimizations in LLM inference create practical timing side channels capable of reconstructing private user prompts. The authors develop a dual-pronged attack framework—Input Constructor and Time Analyzer—that exploit prefix and semantic caching to recover inputs with high accuracy under realistic constraints. Extensive experiments on vLLM and GPTCache show robust timing patterns enabling field-level recovery and semantic-content leakage across medical and legal domains, underscoring privacy risks in cloud-based inference. The work highlights a critical trade-off between performance optimization and privacy, and proposes defenses such as per-user cache isolation, rate limiting, and timing obfuscation to mitigate these vulnerabilities. Overall, the paper offers a thorough analysis of cache-based timing leaks in LLM services and provides concrete guidance for securing production deployments against such side-channel threats.

Abstract

Large language models (LLMs) possess extensive knowledge and question-answering capabilities, having been widely deployed in privacy-sensitive domains like finance and medical consultation. During LLM inferences, cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests. However, we identify that these cache mechanisms pose a risk of private input leakage, as the caching can result in observable variations in response times, making them a strong candidate for a timing-based attack hint. In this study, we propose a novel timing-based side-channel attack to execute input theft in LLMs inference. The cache-based attack faces the challenge of constructing candidate inputs in a large search space to hit and steal cached user queries. To address these challenges, we propose two primary components. The input constructor employs machine learning techniques and LLM-based approaches for vocabulary correlation learning while implementing optimized search mechanisms for generalized input construction. The time analyzer implements statistical time fitting with outlier elimination to identify cache hit patterns, continuously providing feedback to refine the constructor's search strategy. We conduct experiments across two cache mechanisms and the results demonstrate that our approach consistently attains high attack success rates in various applications. Our work highlights the security vulnerabilities associated with performance optimizations, underscoring the necessity of prioritizing privacy and security alongside enhancements in LLM inference.

Paper Structure

This paper contains 25 sections, 1 equation, 11 figures, 5 tables.

Figures (11)

  • Figure 1: The prefill time difference between cache hits and misses for varying input lengths with OpenAI API calls GPT-4o-mini LLM. (a) Prefix caching implemented by OpenAI. (b) Semantic caching with GPTCache.
  • Figure 2: Comparison of self-attention computation mechanisms. The traditional approach (upper) performs full recomputation for each token, while the KV cache (lower) reuses stored key-value vectors to accelerate inference. The KV Cache reduces the computational complexity per decoding step from $O(n^2)$ to $O(n)$.
  • Figure 3: Overview of the RAG-assisted LLM system with the semantic caching mechanism. User queries are first matched against cached requests based on semantic similarity. Responses are retrieved directly from the cache if the similarity score exceeds the threshold; otherwise, the system proceeds with vector database retrieval and LLM inference.
  • Figure 4: Time difference between hits and misses for prefix caching: 100 experiments in vLLM by a local API deployment using the LLaMa-2 70B model. (a)Time for input with varying lengths taken to generate one token. (b)Time for input with the same length to generate different numbers of tokens.
  • Figure 5: Time difference between hits and misses for semantic caching: conducted in GPTCache by invoking API to access GPT-4o-mini. (a)Time for different inputs with varying lengths to generate one token. (b)Time for input with varying lengths to generate complete responses.
  • ...and 6 more figures