Graph-Structured Speculative Decoding

Zhuocheng Gong; Jiahao Liu; Ziyue Wang; Pengfei Wu; Jingang Wang; Xunliang Cai; Dongyan Zhao; Rui Yan

Graph-Structured Speculative Decoding

Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan

TL;DR

This work targets the efficiency bottleneck in decoding large language models by improving speculative decoding. It introduces Graph-structured Speculative Decoding (GSD), a DAG-based token-graph framework that shares tokens across multiple drafted hypotheses to reduce draft-time computation. Through redundant-node merging and pruning, GSD increases the acceptance rate of drafted tokens while keeping the computational budget in check, achieving speedups up to $1.96\times$ on LLaMA-2-70b with minimal quality loss. The approach yields practical impact for faster, scalable LLM inference across multiple model families and tasks.

Abstract

Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This structure enables us to efficiently predict and merge recurring token sequences, vastly reducing the computational demands of the draft model. We term this approach Graph-structured Speculative Decoding (GSD). We apply GSD across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and observe a remarkable speedup of 1.73$\times$ to 1.96$\times$, significantly surpassing standard speculative decoding.

Graph-Structured Speculative Decoding

TL;DR

on LLaMA-2-70b with minimal quality loss. The approach yields practical impact for faster, scalable LLM inference across multiple model families and tasks.

Abstract

to 1.96

, significantly surpassing standard speculative decoding.

Paper Structure (34 sections, 1 equation, 8 figures, 10 tables)

This paper contains 34 sections, 1 equation, 8 figures, 10 tables.

Introduction
Related Works
LLM Compression
LLM Decoding Acceleration
Preliminaries: Sequence-structured Speculative Decoding
A Step Forward: Tree-structured Speculative Decoding
Parallelized drafting and verifying via tree attention
Pruning inferior branches
Graph-structured Speculative Decoding
Same tokens re-occur among hypotheses
Identifying redundant nodes
Merging redundant nodes
How does node merging hurt the performance?
Token graph verification
Experiments
...and 19 more sections

Figures (8)

Figure 1: An illustrative comparison between the tree- and graph-structured draft token management.
Figure 2: Overview of our method. (Left) GSD advances beyond TSD and SSD by implementing pruning strategies along with a re-occurring node merging technique. (Right) An illustration demonstrates the process by which the token tree (or graph) is flattened to a sequence. The sequence is then paired with a customized attention mask designed to uphold the proper dependencies between tokens to perform efficient drafting and verifying.
Figure 3: The proportion of tokens that are part of re-occurring n-grams within the token tree where the maximum out-degree $k$ is 4. $\theta_{prob} = 0.2$ and $\theta_{sib}=0.3$.
Figure 4: An illustration of how the token graph operates during the draft stage and the verification stage.
Figure 5: A series of ablation studies to investigate the hyperparameter configuration of maximum out-degree, redundant threshold, and two pruning techniques. All other hyperparameters adhere to the configuration described in section \ref{['ap:cfg']}.
...and 3 more figures

Graph-Structured Speculative Decoding

TL;DR

Abstract

Graph-Structured Speculative Decoding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)