Table of Contents
Fetching ...

Speculative Decoding with a Speculative Vocabulary

Miles Williams, Young D. Kwon, Rui Li, Alexandros Kouris, Stylianos I. Venieris

TL;DR

This work addresses the vocabulary bottleneck in speculative decoding for autoregressive LMs by introducing SpecVocab, a dynamic vocabulary speculation method. SpecVocab computes a context-aware subset of the target vocabulary at each decoding step using a two-stage ranking: a down-projected hidden state yields approximate logits from which a top-k candidate set is formed, after which exact logits are computed over that subset. Trained via distillation with an auxiliary loss, SpecVocab can be integrated with strong speculative decoders like EAGLE-3 and is supported by a custom fused kernel to accelerate the per-step computations. Empirically, SpecVocab outperforms state-of-the-art static-vocabulary approaches and EAGLE-3 across multiple models and tasks, delivering up to 8.1% average throughput gains and substantial acceptance-length improvements, while also demonstrating favorable scaling with model size. These results highlight the practical potential of dynamic vocabulary speculation to significantly accelerate LM inference without compromising output quality.

Abstract

Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.

Speculative Decoding with a Speculative Vocabulary

TL;DR

This work addresses the vocabulary bottleneck in speculative decoding for autoregressive LMs by introducing SpecVocab, a dynamic vocabulary speculation method. SpecVocab computes a context-aware subset of the target vocabulary at each decoding step using a two-stage ranking: a down-projected hidden state yields approximate logits from which a top-k candidate set is formed, after which exact logits are computed over that subset. Trained via distillation with an auxiliary loss, SpecVocab can be integrated with strong speculative decoders like EAGLE-3 and is supported by a custom fused kernel to accelerate the per-step computations. Empirically, SpecVocab outperforms state-of-the-art static-vocabulary approaches and EAGLE-3 across multiple models and tasks, delivering up to 8.1% average throughput gains and substantial acceptance-length improvements, while also demonstrating favorable scaling with model size. These results highlight the practical potential of dynamic vocabulary speculation to significantly accelerate LM inference without compromising output quality.

Abstract

Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.
Paper Structure (43 sections, 6 equations, 5 figures, 13 tables)

This paper contains 43 sections, 6 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Vocabulary speculation accelerates speculative decoding by computing the output distribution for only a contextually relevant subset of the vocabulary.
  • Figure 2: Outline of the draft model architectures for speculative decoding. EAGLE-2 forms predictions over the entire target model vocabulary, whereas EAGLE-3 uses a fixed subset, as in FR-Spec and VocabTrim. In contrast, SpecVocab (ours) speculates on which subset of the target model vocabulary to use at each decoding step.
  • Figure 3: The sequence of memory access operations required by the indexed LM head operation, alternating between global and cache memory.
  • Figure 4: The acceptance length and throughput when varying both the number of candidate tokens ($k$) and intermediate dimensionality of our method relative to the target model ($d'/d$). We present the results for every model across five seeds, with the standard deviation denoted by the shaded area.
  • Figure 5: Microbenchmark results for our custom fused kernel versus a PyTorch baseline, for each model.