SPIRe: Boosting LLM Inference Throughput with Speculative Decoding
Sanjit Neelam, Daniel Heinlein, Vaclav Cvicek, Akshay Mishra, Reiner Pope
TL;DR
This work tackles the cost-per-token challenge in autoregressive decoding by reframing speculative decoding as a throughput problem. It introduces SPIRe, a draft model that combines Sliding window KV cache, Pruned Initialization, and a Feedback Transformer with a distillation objective to maximize throughput multipliers for large batch sizes and long contexts. Through an implementation-agnostic evaluation framework and targeted ablations, SPIRe achieves substantial throughput gains over standard speculative decoding and the MagicDec baseline, with favorable cost-to-benefit trade-offs. The approach demonstrates practical impact for production LLM serving by enabling efficient exploration of draft-model designs across varying batch sizes and context lengths while providing a concrete evaluation methodology and analysis of trade-offs and extrapolations toward larger models like Llama 3 70B.
Abstract
Speculative decoding (SD) has been shown to reduce the latency of autoregressive decoding (AD) by 2-3x for small batch sizes. However, increasing throughput and therefore reducing the cost per token requires decoding with large batch sizes. Recent work shows that SD can accelerate decoding with large batch sizes too if the context is sufficiently long and the draft model's KV cache is sparse. We introduce SPIRe, a draft model that combines static sparse attention, pruned initialization, and feedback memory to increase the modeled throughput of speculative decoding by over 100% compared to speculation with a much smaller draft model and by over 35% compared to the strong baseline of sparse self-speculation. Our approach is particularly effective when context lengths vary significantly across requests.
