Table of Contents
Fetching ...

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao

TL;DR

A novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass, and outperforms all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.

Abstract

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

TL;DR

A novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass, and outperforms all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.

Abstract

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
Paper Structure (20 sections, 4 equations, 6 figures, 1 table)

This paper contains 20 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 2: Overview of our method. The upper-left is the training process of hidden transfer, $N$ and $M$ represent the numbers of transformer layers. The upper-right and the bottom of the figure are the inference process of Hidden transfer and Medusa respectively, assuming that the generation of both methods starts from the same context and their inputs are the candidate token sequences generated in the last round, Medusa and Hidden transfer both verify the candidate token sequences to find the last accepted token position(i.e the fourth token of the input) and generate the next token and new draft tokens at the same time(for simplicity we only consider 2 transfer steps/medusa heads), the next token and draft tokens construct to a tree structure and are then flatted into a sequence to be verified with tree attention, after the verification stage, result shows Hidden transfer has more prediction accuracy and more draft tokens accepted
  • Figure 3: TopK tokens' prediction accuracy using three prediction methods on LLaMA-2-Chat-13B model including directly train different lm-heads on some intermediate layers (denoted as Early exit in the figure), Medusa method and our hidden transfer method (We transfer the pseudo intermediate hidden states of the next 3 tokens on the $25_{th}$, $30_{th}$ and $35_{th}$ layers respectively), The $N$ in the figure is the prediction step (N=2 means we predict the first draft token). It's clear that our method achieve the best prediction accuracy
  • Figure 4: Hidden states similarity between the virtual hidden states predicted and the original hidden states.
  • Figure 5: The first transfer step prediction accuracy on different layers for Vicuna-7b and LlaMa-2-Chat-7b. TopK means the topk tokens predicted by the transfer step include
  • Figure 6: The second transfer step prediction accuracy on different layers for Vicuna-7b and LlaMa-2-Chat-7b with the fixed transfer step 15. TopK means the topk tokens predicted by the transfer step include
  • ...and 1 more figures