Table of Contents
Fetching ...

MineDraft: A Framework for Batch Parallel Speculative Decoding

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

Abstract

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

MineDraft: A Framework for Batch Parallel Speculative Decoding

Abstract

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.
Paper Structure (25 sections, 4 theorems, 23 equations, 25 figures)

This paper contains 25 sections, 4 theorems, 23 equations, 25 figures.

Key Result

Theorem 1

Let $W$ be the Lambert $W$ function and $f(t) = 1 - e^{-\alpha t}$ for some constant $\alpha \in \mathbb{R}^+$. For $\alpha V \geq -W_{-1}(-\frac{1}{2e})-1 \approx 1.68$, we have $T_{\textnormal{SD}} > 1.59\ T_{\textnormal{PSD}}$.

Figures (25)

  • Figure 1: MineDraft parallelizes drafting and verification: a draft model generates tokens while the target model simultaneously verifies the previously generated draft tokens, thereby hiding drafting latency and improving overall inference throughput.
  • Figure 2: Architecture overview of MineDraft. (Left) The Scheduler manages request life-cycles and batch IDs by coordinating with the Batch Manager, which maintains two batches to enable parallelism in MineDraft. (Right) Parallel execution timeline of the Drafter and Verifier across speculative decoding (SD) steps. Magenta blocks/arrows denote broadcast of drafts from the Drafter to the Verifier, while dark green blocks/arrows denote point-to-point dispatch of target sampler outputs from the Verifier to the Drafter. Ticks indicate synchronization points between SD steps. During the initial SD step (before $s_1$), the Drafter sequentially drafts Batch 0, broadcasts drafts to the Verifier, then drafts Batch 1, while the Verifier immediately processes Batch 0 upon receipt and returns outputs to the Drafter. In subsequent SD steps, the two batches alternate roles between drafting and verification, enabling overlapped execution. Batch alternation is triggered when the Drafter returns outputs to the Scheduler at the end of each SD step.
  • Figure 3: Throughput comparison against baseline methods across model settings 1--3. $\uparrow$ indicates the average improvement over the best baseline method. $\Delta$ indicates the maximum average gap between MineDraft and standard SD. More details about $\uparrow$ and $\Delta$ are provided in \ref{['app:performance_improvement']}. MineDraft consistently outperforms baselines, improving average throughput by up to 65.02% over the best-performing baseline and by up to 75.68% over standard SD.
  • Figure 4: Throughput comparison across model settings 5 and 6. $\uparrow$ indicates the average improvement over the best baseline method. $\Delta$ indicates the maximum average gap between MineDraft and standard SD or EAGLE. MineDraft consistently outperforms standalone EAGLE and standard SD, achieving maximum average throughput gains of 37.06% and 22.09%, respectively.
  • Figure 5: Throughput results on model setting 4. Standard SD experiments failed on this setting due to OOM.
  • ...and 20 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof