Table of Contents
Fetching ...

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu

TL;DR

ParallelSpec is presented, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches that learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output distributions of the drafter and the target model with minimal training cost.

Abstract

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most existing works still draft tokens auto-regressively to maintain sequential dependency in language modeling, which we consider a huge computational burden in speculative decoding. We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model. ParallelSpec learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output distributions of the drafter and the target model with minimal training cost. Experimental results show that ParallelSpec accelerates baseline methods in latency up to 62% on text generation benchmarks from different domains, and it achieves 2.84X overall speedup on the Llama-2-13B model using third-party evaluation criteria.

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

TL;DR

ParallelSpec is presented, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches that learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output distributions of the drafter and the target model with minimal training cost.

Abstract

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most existing works still draft tokens auto-regressively to maintain sequential dependency in language modeling, which we consider a huge computational burden in speculative decoding. We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model. ParallelSpec learns to efficiently predict multiple future tokens in parallel using a single model, and it can be integrated into any speculative decoding framework that requires aligning the output distributions of the drafter and the target model with minimal training cost. Experimental results show that ParallelSpec accelerates baseline methods in latency up to 62% on text generation benchmarks from different domains, and it achieves 2.84X overall speedup on the Llama-2-13B model using third-party evaluation criteria.
Paper Structure (16 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of ParallelSpec. Left: comparison between auto-regressive drafting and our proposed parallel drafting. Blocks in green indicate normal draft tokens. Blocks in yellow denote the mask tokens used to prompt the draft model to generate multiple future tokens in a single forward pass. Right: wall time trace diagrams for two drafting styles integrated with EAGLE eagle in two rounds of speculative sampling.
  • Figure 2: Illustration of parallel drafter inference, training, and the difference between training auto-regressive drafter and parallel one. Left: Parallel drafter proposes multiple candidate tokens with a single forward pass. Middle: Training the parallel drafter to align with the target model is a process of knowledge distillation (KD). Right: The input, labels, and position indices for training a parallel drafter need special treatment. $\dag$ refers to Figure \ref{['fig:attention_mask']} for the special attention mask design of training.
  • Figure 3: Attention mask illustration of parallel drafter training. denotes activated attention. denotes attention suppressed to prevent access across parallel groups. -100 denotes ignored tokens in the target sequence that do not contribute to training loss. Blocks with yellow and the legend illustrate one of the next-next token prediction training objectives.
  • Figure 4: Ablations on speedup ratio and average acceptance length $\tau$ with respect to the number of [MASK] tokens $K$ on all three test datasets.
  • Figure 5: Upper: Visualization of accelerated tokens in generation from (a) Vicuna-7B Medusa and (b) Vicuna-7B Medusa-ParallelSpec given an input prompt from GSM8K gsm8k. Lower: Simulated wall-time trace of two different methods generating the text in the highlighted box. We only consider the forward pass latency of draft and verification while ignoring the negligible post-processing overhead. : accepted draft tokens. : rejected draft tokens. : tokens without speculative acceleration.
  • ...and 1 more figures