Table of Contents
Fetching ...

HiSpec: Hierarchical Speculative Decoding for LLMs

Avinash Kumar, Sujay Sanghavi, Poulami Das

TL;DR

The paper tackles the verification wall in speculative decoding for large language models by introducing HiSpec, a hierarchical framework that uses early-exit (EE) models to perform low-overhead intermediate verification and reuses key-value caches across the draft, intermediate verifier, and target. It strategically positions the intermediate verifier at about $\frac{1}{4}$ of model depth and uses a draft exit at about $\frac{1}{8}$ depth, along with a dynamic policy that triggers full-model verification only after a small number of tokens are tentatively accepted. Through experiments on diverse benchmarks and models, HiSpec achieves average throughput improvements of $1.28\times$ and up to $2.01\times$ over baseline single-layer speculation, while maintaining the same accuracy as the target model. The approach is validated across pre-trained and post-training EE variants, demonstrating strong generalization and practical impact for high-throughput LLM inference with large targets. The work highlights that accelerating verification, not just drafting, is crucial for scalable deployment of autoregressive, transformer-based LLMs.

Abstract

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

HiSpec: Hierarchical Speculative Decoding for LLMs

TL;DR

The paper tackles the verification wall in speculative decoding for large language models by introducing HiSpec, a hierarchical framework that uses early-exit (EE) models to perform low-overhead intermediate verification and reuses key-value caches across the draft, intermediate verifier, and target. It strategically positions the intermediate verifier at about of model depth and uses a draft exit at about depth, along with a dynamic policy that triggers full-model verification only after a small number of tokens are tentatively accepted. Through experiments on diverse benchmarks and models, HiSpec achieves average throughput improvements of and up to over baseline single-layer speculation, while maintaining the same accuracy as the target model. The approach is validated across pre-trained and post-training EE variants, demonstrating strong generalization and practical impact for high-throughput LLM inference with large targets. The work highlights that accelerating verification, not just drafting, is crucial for scalable deployment of autoregressive, transformer-based LLMs.

Abstract

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose , a framework for high-throughput speculative decoding that exploits for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28 on average and by up to 2.01 compared to the baseline single-layer speculation without compromising accuracy.

Paper Structure

This paper contains 15 sections, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Throughput of various representative benchmarks for the Llama3-8B model relative to standard auto-regressive decoding (higher throughput is better). HiSpec consistently outperforms state-of-the-art prior works (AdaDecode, Lookahead Decoding, and LayerSkip) that mainly focus on accelerating draft token generation.
  • Figure 2: Latency of the draft token generation and token verification phases for different draft and target model combinations for the ShareGPT dataset sharegpt. Verification takes $2$-$10.3\times$ longer than token generation and the gap between the two latencies grows with the size of the target models.
  • Figure 3: (a) Standard speculative decoding. (b) Our proposal, HiSpec, uses early-exit models for intermediate verification to reject inaccurate tokens early, thereby also accelerating subsequent draft generation. HiSpec reuses KV caches and hidden states to improve compute and memory efficiency and performs periodic target verification to maintain accuracy.
  • Figure 4: Percentage of tokens produced by one-fourth the model that are accepted by the final layer. We use llama models for (a) text summarization (CNN/DM, Xsum) and (b) other tasks, such as dialogue (ShareGPT), mathematical reasoning (GSM8K), and code generation (HumanEval). Each label denotes one-fourth the model and its final layer (such as $L8 \rightarrow L32$ for Llama-8B with 32 layers). We observe that about one-fourth of the model is sufficient to generate up to 69% of the output tokens correctly. We use this information to position the intermediate verifier in HiSpec.
  • Figure 5: Throughput of HiSpec for different draft and intermediate layer combinations relative to vanilla auto-regressive decoding for the (a) ShareGPT dataset using Llama3-8B (32 layers) and the (b) CNN/DM dataset using Llama2-70B (80 layers). HiSpec's selection of draft and intermediate verifier (circled) yields the highest throughput.
  • ...and 4 more figures