Accelerating Production LLMs with Combined Token/Embedding Speculators
Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa
TL;DR
The paper addresses the challenge of deploying large LLMs in production by introducing a speculator that performs speculative decoding conditioned on both the base model state and sampled tokens to draft multiple tokens per forward pass. It presents a multi-stage, multi-head speculator architecture and a two-stage training pipeline that aligns the speculator with base-model outputs, achieving 2–3x wall-clock speedups across Llama2-7B/13B, Codellama-13B-instruct, and Granite-20B in production-like settings. Key contributions include open-sourcing the code, demonstrating production-relevant speedups, and providing analysis of how performance scales with workload, prompt length, and parallelism, along with guidance for dynamic deployment strategies. The work offers practical pathways to accelerate production LLM inference while highlighting the need for adaptive strategies to preserve fidelity under high-load conditions.
Abstract
This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.
