FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
TL;DR
FR-Spec tackles slow decoding in large-vocabulary LLMs by exposing LM Head bottlenecks in speculative sampling and introducing a frequency-ranked draft space based on corpus-level token frequencies. It reduces LM Head and softmax computation by restricting drafting to a high-frequency subset $\mathcal{V}_{high}$, achieving a complexity reduction from $\mathcal{O}(n d |\mathcal{V}|)$ to $\mathcal{O}(n d |\mathcal{V}_{high}|)$ while preserving the final output distribution through verification on the full vocabulary. Empirical results show about $1.12\times$ speedup over EAGLE-2 and $1.08\times$ over Medusa on multiple LLMs, with best settings around $|\mathcal{V}_{high}| = 32k$; the approach is plug-and-play and requires no retraining. The work highlights the importance of implementation details and vocabulary structure in speeding up large-vocabulary decoding, offering a practical path for deployment on resource-limited devices.
Abstract
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.
