Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
Hao Mark Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan
TL;DR
The paper tackles the latency and memory bottlenecks of autoregressive LLM inference by introducing Parallel Prompt Decoding (PPD), a memory-efficient, trainable-prompt approach that enables parallel multi-token generation without a heavy draft model. PPD trains ensemble prompt tokens comprising a tiny fraction of total parameters and employs a hardware-aware sparse-tree pruning mechanism to adapt to different GPUs, achieving up to 2.49× speedups with negligible runtime memory. The method leverages prompt tokens and ensemble embeddings to partially recover conditional dependencies across timesteps, and uses knowledge distillation to align the prompt-tuning behavior with the original LLM. Empirically, PPD demonstrates strong performance across MobileLLaMA and Vicuna models on MT-Bench, GSM8K, and HumanEval, and can be used orthogonally with speculative decoding to yield further speedups, offering practical benefits for edge deployment and cost-conscious inference.
Abstract
The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these efforts have primarily focused on improving processing speed such as throughput. Crucially, they often neglect other metrics essential for real-life deployments, such as memory consumption and training cost. To overcome these limitations, we propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Inspired by the human natural language generation process, $PPD$ approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. This approach partially recovers the missing conditional dependency information necessary for multi-token generation, resulting in up to a 28% higher acceptance rate for long-range predictions. Furthermore, we present a hardware-aware dynamic sparse tree technique that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49$\times$ speedup and maintains a minimal runtime memory overhead of just $0.0004$%. More importantly, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to $1.22\times$ further speed improvement. Our code is available at https://github.com/hmarkc/parallel-prompt-decoding.
