PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

Zihao An; Huajun Bai; Ziqiong Liu; Dong Li; Emad Barsoum

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, Emad Barsoum

TL;DR

PARD targets the fundamental latency in autoregressive LLM inference by proposing a target-independent speculative decoding framework with parallel draft predictions. It introduces mask-token based parallel drafting and a Conditional Drop-token (COD) training scheme to reduce adaptation costs while preserving accuracy. Empirical results on vLLM show up to 3.67x speedups on LLaMA3.1-8B and strong improvements across LLaMA3 and Qwen families, with training efficiency gains of up to 7x over prior methods. The approach demonstrates practical deployment benefits through cross-model acceleration, maintaining high throughput and generalization with a streamlined training process.

Abstract

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a promising solution, adopting a draft-then-verify strategy to accelerate token generation. While the EAGLE series achieves strong acceleration, its requirement of training a separate draft head for each target model introduces substantial adaptation costs. In this work, we propose \textbf{PARD (PARallel Draft)}, a novel speculative decoding method featuring \textit{target-independence} and \textit{parallel token prediction}. Specifically, PARD enables a single draft model to be applied across an entire family of target models without requiring separate training for each variant, thereby minimizing adaptation costs. Meanwhile, PARD substantially accelerates inference by predicting multiple future tokens within a single forward pass of the draft phase. To further reduce the training adaptation cost of PARD, we propose a COnditional Drop-token (COD) mechanism based on the integrity of prefix key-value states, enabling autoregressive draft models to be adapted into parallel draft models at low-cost. Our experiments show that the proposed COD method improves draft model training efficiency by \textbf{3$\times$} compared with traditional masked prediction training. On the \texttt{vLLM} inference framework, PARD achieves up to \textbf{3.67$\times$} speedup on LLaMA3.1-8B, reaching \textbf{264.88} tokens per second, which is \textbf{1.15$\times$} faster than EAGLE-3. Our code is available at https://github.com/AMD-AIG-AIMA/PARD.

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

TL;DR

Abstract

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)