When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs
Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar
TL;DR
This work reveals privacy risks in speculative decoding for LLMs, showing input-dependent speculation patterns yield side-channel signals—via per-iteration token counts and packet sizes—that can reveal user prompts and datastore contents. It experimentally assesses multiple speculative schemes (LADE, REST, BiLD, EAGLE) and demonstrates high fingerprinting accuracy and datastore leakage, including remote vLLM scenarios and robustness under distribution shift. The authors propose practical mitigations, notably token aggregation and packet padding, and quantify their effectiveness and trade-offs, including substantial payload overhead in fixed-padding defenses. The findings underscore the need for privacy-preserving deployment of speculative decoding in production systems and offer concrete defenses to mitigate information leakage without compromising throughput.
Abstract
Deployed large language models (LLMs) often rely on speculative decoding, a technique that generates and verifies multiple candidate tokens in parallel, to improve throughput and latency. In this work, we reveal a new side-channel whereby input-dependent patterns of correct and incorrect speculations can be inferred by monitoring per-iteration token counts or packet sizes. In evaluations using research prototypes and production-grade vLLM serving frameworks, we show that an adversary monitoring these patterns can fingerprint user queries (from a set of 50 prompts) with over 75% accuracy across four speculative-decoding schemes at temperature 0.3: REST (100%), LADE (91.6%), BiLD (95.2%), and EAGLE (77.6%). Even at temperature 1.0, accuracy remains far above the 2% random baseline - REST (99.6%), LADE (61.2%), BiLD (63.6%), and EAGLE (24%). We also show the capability of the attacker to leak confidential datastore contents used for prediction at rates exceeding 25 tokens/sec. To defend against these, we propose and evaluate a suite of mitigations, including packet padding and iteration-wise token aggregation.
