Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

Hao Mark Chen; Wayne Luk; Ka Fai Cedric Yiu; Rui Li; Konstantin Mishchenko; Stylianos I. Venieris; Hongxiang Fan

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan

TL;DR

The paper tackles the latency and memory bottlenecks of autoregressive LLM inference by introducing Parallel Prompt Decoding (PPD), a memory-efficient, trainable-prompt approach that enables parallel multi-token generation without a heavy draft model. PPD trains ensemble prompt tokens comprising a tiny fraction of total parameters and employs a hardware-aware sparse-tree pruning mechanism to adapt to different GPUs, achieving up to 2.49× speedups with negligible runtime memory. The method leverages prompt tokens and ensemble embeddings to partially recover conditional dependencies across timesteps, and uses knowledge distillation to align the prompt-tuning behavior with the original LLM. Empirically, PPD demonstrates strong performance across MobileLLaMA and Vicuna models on MT-Bench, GSM8K, and HumanEval, and can be used orthogonally with speculative decoding to yield further speedups, offering practical benefits for edge deployment and cost-conscious inference.

Abstract

The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these efforts have primarily focused on improving processing speed such as throughput. Crucially, they often neglect other metrics essential for real-life deployments, such as memory consumption and training cost. To overcome these limitations, we propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Inspired by the human natural language generation process, $PPD$ approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. This approach partially recovers the missing conditional dependency information necessary for multi-token generation, resulting in up to a 28% higher acceptance rate for long-range predictions. Furthermore, we present a hardware-aware dynamic sparse tree technique that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49$\times$ speedup and maintains a minimal runtime memory overhead of just $0.0004$%. More importantly, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to $1.22\times$ further speed improvement. Our code is available at https://github.com/hmarkc/parallel-prompt-decoding.

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

TL;DR

Abstract

% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Inspired by the human natural language generation process,

approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. This approach partially recovers the missing conditional dependency information necessary for multi-token generation, resulting in up to a 28% higher acceptance rate for long-range predictions. Furthermore, we present a hardware-aware dynamic sparse tree technique that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49

speedup and maintains a minimal runtime memory overhead of just

%. More importantly, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to

further speed improvement. Our code is available at https://github.com/hmarkc/parallel-prompt-decoding.

Paper Structure (42 sections, 4 theorems, 3 equations, 14 figures, 13 tables)

This paper contains 42 sections, 4 theorems, 3 equations, 14 figures, 13 tables.

Introduction
Background and Related Work
Speculative Decoding
Parallel Decoding
Parallel Prompt Decoding (PPD)
Prompt Tokens
Ensemble Prompt Tokens
Training
Sparse Tree Pruning
Customized Sparse Tree Attention
Two-Stage Tree Pruning Algorithm
Experiments
Speedup Comparison with Parallel Decoding Methods
Long-range Token Prediction
Memory and Training Efficiency
...and 27 more sections

Key Result

Proposition B.1

For a sparse tree state $T_k$, where each candidate token $v$ follows a path $\text{Path}(v)$ from the root, and the acceptance probability $p_k$ at each path position $k$, the expected number of tokens $f(T_k)$ generated is given by $f(T_k) = \sum_{v \in \text{C}(T_k)} \prod_{i \in \text{Path}(v)}

Figures (14)

Figure 1: Comparison of (a) memory, speedup, and training cost on MT-Bench with Vicuna-7B, where circle diameter represents training GPU hours, and (b) the memory overhead change with sequence length.
Figure 2: Overview of PPD. The left section shows the location of trainable parameters and the middle section displays the combined guess-and-verify process during inference. The "prompt token" denotes the special token with separately trained embeddings to perform parallel prediction.
Figure 3: Illustration of Tree Pruning Pipeline. The tree structure is optimized as a result of pruning.
Figure 4: Comparative evaluation of latency speedup between PPD and other parallel decoding methods. The experiments were conducted using the MT-Bench dataset, with the temperature set to MT-Bench’s default configuration for Medusa and PPD.
Figure 5: Throughput of PPD and vanilla models across different tasks (multi-turn dialogue, coding, and math). The temperature for experiments is set to 0 and the generated output exactly matches that of the original LLM. We do not show the results of Vicuna-13B on RTX 4090 as it does not fit into the GPU memory.
...and 9 more figures

Theorems & Definitions (5)

Definition B.1
Proposition B.1
Proposition B.2
Proposition B.3
Proposition B.4

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

TL;DR

Abstract

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (5)