Table of Contents
Fetching ...

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, Emad Barsoum

TL;DR

PARD targets the fundamental latency in autoregressive LLM inference by proposing a target-independent speculative decoding framework with parallel draft predictions. It introduces mask-token based parallel drafting and a Conditional Drop-token (COD) training scheme to reduce adaptation costs while preserving accuracy. Empirical results on vLLM show up to 3.67x speedups on LLaMA3.1-8B and strong improvements across LLaMA3 and Qwen families, with training efficiency gains of up to 7x over prior methods. The approach demonstrates practical deployment benefits through cross-model acceleration, maintaining high throughput and generalization with a streamlined training process.

Abstract

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a promising solution, adopting a draft-then-verify strategy to accelerate token generation. While the EAGLE series achieves strong acceleration, its requirement of training a separate draft head for each target model introduces substantial adaptation costs. In this work, we propose \textbf{PARD (PARallel Draft)}, a novel speculative decoding method featuring \textit{target-independence} and \textit{parallel token prediction}. Specifically, PARD enables a single draft model to be applied across an entire family of target models without requiring separate training for each variant, thereby minimizing adaptation costs. Meanwhile, PARD substantially accelerates inference by predicting multiple future tokens within a single forward pass of the draft phase. To further reduce the training adaptation cost of PARD, we propose a COnditional Drop-token (COD) mechanism based on the integrity of prefix key-value states, enabling autoregressive draft models to be adapted into parallel draft models at low-cost. Our experiments show that the proposed COD method improves draft model training efficiency by \textbf{3$\times$} compared with traditional masked prediction training. On the \texttt{vLLM} inference framework, PARD achieves up to \textbf{3.67$\times$} speedup on LLaMA3.1-8B, reaching \textbf{264.88} tokens per second, which is \textbf{1.15$\times$} faster than EAGLE-3. Our code is available at https://github.com/AMD-AIG-AIMA/PARD.

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

TL;DR

PARD targets the fundamental latency in autoregressive LLM inference by proposing a target-independent speculative decoding framework with parallel draft predictions. It introduces mask-token based parallel drafting and a Conditional Drop-token (COD) training scheme to reduce adaptation costs while preserving accuracy. Empirical results on vLLM show up to 3.67x speedups on LLaMA3.1-8B and strong improvements across LLaMA3 and Qwen families, with training efficiency gains of up to 7x over prior methods. The approach demonstrates practical deployment benefits through cross-model acceleration, maintaining high throughput and generalization with a streamlined training process.

Abstract

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a promising solution, adopting a draft-then-verify strategy to accelerate token generation. While the EAGLE series achieves strong acceleration, its requirement of training a separate draft head for each target model introduces substantial adaptation costs. In this work, we propose \textbf{PARD (PARallel Draft)}, a novel speculative decoding method featuring \textit{target-independence} and \textit{parallel token prediction}. Specifically, PARD enables a single draft model to be applied across an entire family of target models without requiring separate training for each variant, thereby minimizing adaptation costs. Meanwhile, PARD substantially accelerates inference by predicting multiple future tokens within a single forward pass of the draft phase. To further reduce the training adaptation cost of PARD, we propose a COnditional Drop-token (COD) mechanism based on the integrity of prefix key-value states, enabling autoregressive draft models to be adapted into parallel draft models at low-cost. Our experiments show that the proposed COD method improves draft model training efficiency by \textbf{3} compared with traditional masked prediction training. On the \texttt{vLLM} inference framework, PARD achieves up to \textbf{3.67} speedup on LLaMA3.1-8B, reaching \textbf{264.88} tokens per second, which is \textbf{1.15} faster than EAGLE-3. Our code is available at https://github.com/AMD-AIG-AIMA/PARD.

Paper Structure

This paper contains 22 sections, 20 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: PARD achieves low latency while maintaining high accuracy. (a) Comparison of first-token acceptance rates using LLaMA3.1-8B as the target model. EAGLE and EAGLE-3 use their official model, vanilla speculative decoding (VSD) employs LLaMA3.2-1B as the draft model, and PARD represents the adapted version of VSD. (b) Comparison of actual inference time between VSD and PARD. VSD generates candidate tokens autoregressively during the draft stage, requiring multiple forward passes. In contrast, PARD completes drafting with a single forward pass. The draft model used is LLaMA3.2-1B and the target model is LLaMA3.1-8B. (c) Illustrative comparison of training and inference efficiency between PARD and other methods.
  • Figure 2: Performance comparison of different methods on the HumanEval task under vLLM. AR denotes the auto-regressive baseline, and VSD denotes vanilla speculative decoding, where the draft models used are LLaMA3.2-1B and Qwen2.5-0.5B.
  • Figure 3: Illustration of PARD Inference. Left: Vanilla speculative decoding involves a draft model auto-regressively generating $K$ candidate tokens, which are then validated by the target model in parallel. Right: PARD introduces mask tokens for parallel Draft. All $K$ candidate tokens are generated in one forward pass.
  • Figure 4: Illustration of Conditional Drop in PARD training. (a) Training data of the standard AR model. (b) Training data of PARD. The diagram is divided into three sections by dashed lines, corresponding to training objectives for predicting tokens at positions $+1$, $+2$, and $+3$. The designed attention mask ensures consistency between training and inference. Labels in lighter font indicate tokens that are supplemented for context completion and do not contribute to the loss computation. (c) Sparse training data for PARD with Conditional Drop, where shaded areas represent dropped tokens. The retention pattern follows a geometric decay with a fraction $r=0.5$ of positions retained for mask token $m_0$ and $r^2=0.25$ for $m_1$, ensuring that each retained token maintains complete preceding key–value pairs. (d) The sparse matrix reorganized into a compact format by eliminating dropped positions.
  • Figure 5: (a) Compare the effects of different values of $r$ and $r_{\text{min}}$, where each experiment is labeled as PARD_$r$_$r_{\text{min}}$. The x-axis represents training time, while the y-axis indicates the final decoding speed. (b) presents the results under different $K_{\text{train}}$ and $K_{\text{infer}}$ settings. The x-axis represents $K_{\text{infer}}$, and the experiment names PARD_$K_\text{train}$ denote different $K_{\text{train}}$ values.