Table of Contents
Fetching ...

Accelerating Production LLMs with Combined Token/Embedding Speculators

Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa

TL;DR

The paper addresses the challenge of deploying large LLMs in production by introducing a speculator that performs speculative decoding conditioned on both the base model state and sampled tokens to draft multiple tokens per forward pass. It presents a multi-stage, multi-head speculator architecture and a two-stage training pipeline that aligns the speculator with base-model outputs, achieving 2–3x wall-clock speedups across Llama2-7B/13B, Codellama-13B-instruct, and Granite-20B in production-like settings. Key contributions include open-sourcing the code, demonstrating production-relevant speedups, and providing analysis of how performance scales with workload, prompt length, and parallelism, along with guidance for dynamic deployment strategies. The work offers practical pathways to accelerate production LLM inference while highlighting the need for adaptive strategies to preserve fidelity under high-load conditions.

Abstract

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.

Accelerating Production LLMs with Combined Token/Embedding Speculators

TL;DR

The paper addresses the challenge of deploying large LLMs in production by introducing a speculator that performs speculative decoding conditioned on both the base model state and sampled tokens to draft multiple tokens per forward pass. It presents a multi-stage, multi-head speculator architecture and a two-stage training pipeline that aligns the speculator with base-model outputs, achieving 2–3x wall-clock speedups across Llama2-7B/13B, Codellama-13B-instruct, and Granite-20B in production-like settings. Key contributions include open-sourcing the code, demonstrating production-relevant speedups, and providing analysis of how performance scales with workload, prompt length, and parallelism, along with guidance for dynamic deployment strategies. The work offers practical pathways to accelerate production LLM inference while highlighting the need for adaptive strategies to preserve fidelity under high-load conditions.

Abstract

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.
Paper Structure (13 sections, 4 figures, 2 tables)

This paper contains 13 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A simple architecture diagram for a 3-headed multi-stage MLP speculator. Z is the latest state vector from the base model, while T[n] is the sampled token at time $t+n$.
  • Figure 2: Per-head training loss curves for Llama2-13B speculator training, stages 1 (left) and 2 (right). Loss values jump downward at the start of stage 2 as the task becomes inherently easier: rather than using base model behavior to predict ground truth text, we are using base model behavior to predict other base model behavior (future tokens).
  • Figure 3: Throughput (x-axis) vs iterative token latency (y-axis) for Llama2-13B with speculative decoding on an inference server. Numbers indicate concurrent users. Top: artificial homogeneous workloads (49 tokens in, 100 tokens out). Bottom: heterogeneous workloads matched to historical logs. Speculators scale better in the heterogeneous case (but at lower throughput) because batch size is upper-bounded by concurrent users $b$, but in practice is lower in expectation.
  • Figure 4: Throughput (x-axis) vs iterative token latency (y-axis) for Granite-20B with speculative decoding on an inference server. Numbers indicate concurrent users. Top: artificial homogeneous workloads (50 tokens in, 100 tokens out). Bottom: heterogeneous workloads matched to historical logs. Speculators scale better in the heterogeneous case (but at lower throughput) because batch size is upper-bounded by concurrent users $b$, but in practice is lower in expectation.