Table of Contents
Fetching ...

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun

TL;DR

FR-Spec tackles slow decoding in large-vocabulary LLMs by exposing LM Head bottlenecks in speculative sampling and introducing a frequency-ranked draft space based on corpus-level token frequencies. It reduces LM Head and softmax computation by restricting drafting to a high-frequency subset $\mathcal{V}_{high}$, achieving a complexity reduction from $\mathcal{O}(n d |\mathcal{V}|)$ to $\mathcal{O}(n d |\mathcal{V}_{high}|)$ while preserving the final output distribution through verification on the full vocabulary. Empirical results show about $1.12\times$ speedup over EAGLE-2 and $1.08\times$ over Medusa on multiple LLMs, with best settings around $|\mathcal{V}_{high}| = 32k$; the approach is plug-and-play and requires no retraining. The work highlights the importance of implementation details and vocabulary structure in speeding up large-vocabulary decoding, offering a practical path for deployment on resource-limited devices.

Abstract

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

TL;DR

FR-Spec tackles slow decoding in large-vocabulary LLMs by exposing LM Head bottlenecks in speculative sampling and introducing a frequency-ranked draft space based on corpus-level token frequencies. It reduces LM Head and softmax computation by restricting drafting to a high-frequency subset , achieving a complexity reduction from to while preserving the final output distribution through verification on the full vocabulary. Empirical results show about speedup over EAGLE-2 and over Medusa on multiple LLMs, with best settings around ; the approach is plug-and-play and requires no retraining. The work highlights the importance of implementation details and vocabulary structure in speeding up large-vocabulary decoding, offering a practical path for deployment on resource-limited devices.

Abstract

Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12 speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.

Paper Structure

This paper contains 18 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of the drafting and verification times of EAGLE-2 implemented by three frameworks (Huggingface, SGLang, and our optimized implementation) for two vocabulary sizes: 32k (Llama-2-7B) and 128k (Llama-3-8B).
  • Figure 2: Token frequency distribution, statistically analyzed using the tokenizer of Llama-3-8B on a subset of 1B tokens randomly sampled from the SlimPajama-627B dataset cerebras2023slimpajama. As shown in the figure, 75% of the vocabulary tokens account for less than 5% of all token occurrences in the dataset, presenting a "Long Tail" effect.
  • Figure 3: (Left) The drafting process of EAGLE-2 when prompt$~P=$"It", beam $width =2$ and search $depth =3$. It picks out the top $K=8$ probability tokens (purple) as the draft tree. (Right) The drafting process of FR-Spec, where the LM Head is cropped during the drafting process while the beam search procedure remains the same.
  • Figure 4: The illustration of the verification process for EAGLE-2 and FR-Spec, given the draft in Figure \ref{['fig:framework_1']}. FR-Spec solely modifies the drafting process while the verification process remains consistent with EAGLE-2.
  • Figure 5: Comparison of Python-based implementation and C-based implementation. X, Y, and Z represent three different short-duration computational tasks.
  • ...and 3 more figures