Table of Contents
Fetching ...

When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs

Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar

TL;DR

This work reveals privacy risks in speculative decoding for LLMs, showing input-dependent speculation patterns yield side-channel signals—via per-iteration token counts and packet sizes—that can reveal user prompts and datastore contents. It experimentally assesses multiple speculative schemes (LADE, REST, BiLD, EAGLE) and demonstrates high fingerprinting accuracy and datastore leakage, including remote vLLM scenarios and robustness under distribution shift. The authors propose practical mitigations, notably token aggregation and packet padding, and quantify their effectiveness and trade-offs, including substantial payload overhead in fixed-padding defenses. The findings underscore the need for privacy-preserving deployment of speculative decoding in production systems and offer concrete defenses to mitigate information leakage without compromising throughput.

Abstract

Deployed large language models (LLMs) often rely on speculative decoding, a technique that generates and verifies multiple candidate tokens in parallel, to improve throughput and latency. In this work, we reveal a new side-channel whereby input-dependent patterns of correct and incorrect speculations can be inferred by monitoring per-iteration token counts or packet sizes. In evaluations using research prototypes and production-grade vLLM serving frameworks, we show that an adversary monitoring these patterns can fingerprint user queries (from a set of 50 prompts) with over 75% accuracy across four speculative-decoding schemes at temperature 0.3: REST (100%), LADE (91.6%), BiLD (95.2%), and EAGLE (77.6%). Even at temperature 1.0, accuracy remains far above the 2% random baseline - REST (99.6%), LADE (61.2%), BiLD (63.6%), and EAGLE (24%). We also show the capability of the attacker to leak confidential datastore contents used for prediction at rates exceeding 25 tokens/sec. To defend against these, we propose and evaluate a suite of mitigations, including packet padding and iteration-wise token aggregation.

When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs

TL;DR

This work reveals privacy risks in speculative decoding for LLMs, showing input-dependent speculation patterns yield side-channel signals—via per-iteration token counts and packet sizes—that can reveal user prompts and datastore contents. It experimentally assesses multiple speculative schemes (LADE, REST, BiLD, EAGLE) and demonstrates high fingerprinting accuracy and datastore leakage, including remote vLLM scenarios and robustness under distribution shift. The authors propose practical mitigations, notably token aggregation and packet padding, and quantify their effectiveness and trade-offs, including substantial payload overhead in fixed-padding defenses. The findings underscore the need for privacy-preserving deployment of speculative decoding in production systems and offer concrete defenses to mitigate information leakage without compromising throughput.

Abstract

Deployed large language models (LLMs) often rely on speculative decoding, a technique that generates and verifies multiple candidate tokens in parallel, to improve throughput and latency. In this work, we reveal a new side-channel whereby input-dependent patterns of correct and incorrect speculations can be inferred by monitoring per-iteration token counts or packet sizes. In evaluations using research prototypes and production-grade vLLM serving frameworks, we show that an adversary monitoring these patterns can fingerprint user queries (from a set of 50 prompts) with over 75% accuracy across four speculative-decoding schemes at temperature 0.3: REST (100%), LADE (91.6%), BiLD (95.2%), and EAGLE (77.6%). Even at temperature 1.0, accuracy remains far above the 2% random baseline - REST (99.6%), LADE (61.2%), BiLD (63.6%), and EAGLE (24%). We also show the capability of the attacker to leak confidential datastore contents used for prediction at rates exceeding 25 tokens/sec. To defend against these, we propose and evaluate a suite of mitigations, including packet padding and iteration-wise token aggregation.

Paper Structure

This paper contains 38 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: (a) In an LLM response with Speculative Decoding (e.g., LADE), tokens are either correctly speculated (blue) and verified in parallel, or mis-speculated (red) and generated via auto-regressive decoding. (b) This pattern can be inferred by measuring the number of tokens generated per iteration, where multiple tokens per iteration indicates correct speculation, and a single token per iteration indicates mis-speculation. (c) Using these patterns, a network-based adversary can fingerprint user queries and learn private user prompts and responses; a malicious user can observe correct predictions and leak out data-stores and hyper-parameters used for predictions.
  • Figure 2: (a) In the offline phase, the network-based attacker profiles variation in number of tokens per iteration influenced by speculative decoding, (based on packet sizes) and trains a classifier. In the online phase, the attacker uses the classifier to leak the input. (b) Packet sizes and tokens per iteration are correlated, allowing variations in packet sizes to be used to approximate token count variations.
  • Figure 3: Accuracy of the query fingerprinting attack on LADE, REST, and BiLD with temperature of 0.8, using 5 to 30 Traces Per Query (TPQ) for training.
  • Figure 4: Attack accuracy of query fingerprinting as temperatures vary (0.3, 0.6, 0.8, 1.0), using 30 Traces Per Query (TPQ) for training.
  • Figure 5: Analysis of datastore leakage from REST.
  • ...and 8 more figures