Table of Contents
Fetching ...

ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

Shuzhang Zhong, Zebin Yang, Meng Li, Ruihao Gong, Runsheng Wang, Ru Huang

TL;DR

The paper tackles the efficiency bottleneck in autoregressive LLM decoding by improving parallel decoding with ProPD, which combines an early pruning stage and a dynamic token-tree generation mechanism. The early pruning reduces the verification workload by discarding unlikely token branches, while the dynamic generation adapts the token-tree size in real time to decoding conditions and batch configurations. A weighted regression-based overhead estimator and a top-k probability framework guide the adaptive sizing, enabling timely and efficient balancing of acceptance length and verification cost. Empirical results on Vicuna models across multiple datasets show that ProPD yields consistent speedups (1.1× to 3.2×) over autoregressive, BPD, and Medusa, with the greatest gains at larger batch sizes and combined configurations outperforming either component alone.

Abstract

Recent advancements in generative large language models (LLMs) have significantly boosted the performance in natural language processing tasks. However, their efficiency is hampered by the inherent limitations in autoregressive token generation. While parallel decoding with token tree verification, e.g., Medusa, has been proposed to improve decoding parallelism and efficiency, it often struggles with maintaining contextual relationships due to its independent token prediction approach and incurs significant verification overhead, especially with large tree sizes and batch processing. In this paper, we propose ProPD, an efficient LLM parallel decoding framework based on dynamic token tree pruning and generation. ProPD features an advanced early pruning mechanism to efficiently eliminate unpromising token sequences to improve verification efficiency. Additionally, it introduces a dynamic token tree generation algorithm to balance the computation and parallelism of the verification phase in real-time and maximize the overall efficiency across different batch sizes, sequence lengths, and tasks, etc. We verify ProPD across a diverse set of datasets, LLMs, and batch sizes and demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.

ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

TL;DR

The paper tackles the efficiency bottleneck in autoregressive LLM decoding by improving parallel decoding with ProPD, which combines an early pruning stage and a dynamic token-tree generation mechanism. The early pruning reduces the verification workload by discarding unlikely token branches, while the dynamic generation adapts the token-tree size in real time to decoding conditions and batch configurations. A weighted regression-based overhead estimator and a top-k probability framework guide the adaptive sizing, enabling timely and efficient balancing of acceptance length and verification cost. Empirical results on Vicuna models across multiple datasets show that ProPD yields consistent speedups (1.1× to 3.2×) over autoregressive, BPD, and Medusa, with the greatest gains at larger batch sizes and combined configurations outperforming either component alone.

Abstract

Recent advancements in generative large language models (LLMs) have significantly boosted the performance in natural language processing tasks. However, their efficiency is hampered by the inherent limitations in autoregressive token generation. While parallel decoding with token tree verification, e.g., Medusa, has been proposed to improve decoding parallelism and efficiency, it often struggles with maintaining contextual relationships due to its independent token prediction approach and incurs significant verification overhead, especially with large tree sizes and batch processing. In this paper, we propose ProPD, an efficient LLM parallel decoding framework based on dynamic token tree pruning and generation. ProPD features an advanced early pruning mechanism to efficiently eliminate unpromising token sequences to improve verification efficiency. Additionally, it introduces a dynamic token tree generation algorithm to balance the computation and parallelism of the verification phase in real-time and maximize the overall efficiency across different batch sizes, sequence lengths, and tasks, etc. We verify ProPD across a diverse set of datasets, LLMs, and batch sizes and demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.
Paper Structure (25 sections, 4 equations, 7 figures, 3 tables)

This paper contains 25 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Workflow of (a) autoregressive decoding, (b) parallel decoding.
  • Figure 2: Example of parallel decoding: (a) model architecture, (b) token tree, (c) part of token tree mask.
  • Figure 3: Iteration time and acceptance length under different scenarios: (a) Top-$k$ accuracy under different early layers, (b) average iteration time under different batch sizes and token tree sizes, (c) average iteration time under different sequence lengths, (d) average acceptance length under different tasks.
  • Figure 4: Overview of ProPD.
  • Figure 5: Early Pruning Algorithm of ProPD.
  • ...and 2 more figures