Table of Contents
Fetching ...

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang

TL;DR

Kangaroo tackles the memory-bound bottleneck of autoregressive LLM inference by eliminating the need for a large, separately trained draft model. It leverages a fixed shallow sub-network augmented with a lightweight adapter to form a self-draft model and adds a double early-exit mechanism to minimize drafting latency. The approach maintains competitive speedups on Spec-Bench, notably up to ~1.7x with far fewer additional parameters than prior methods, and it demonstrates that dynamic drafting further improves end-to-end performance. The work delivers a practical, parameter-efficient pathway to accelerate inference in large language models without sacrificing sampling fidelity.

Abstract

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to $1.68\times$ on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

TL;DR

Kangaroo tackles the memory-bound bottleneck of autoregressive LLM inference by eliminating the need for a large, separately trained draft model. It leverages a fixed shallow sub-network augmented with a lightweight adapter to form a self-draft model and adds a double early-exit mechanism to minimize drafting latency. The approach maintains competitive speedups on Spec-Bench, notably up to ~1.7x with far fewer additional parameters than prior methods, and it demonstrates that dynamic drafting further improves end-to-end performance. The work delivers a practical, parameter-efficient pathway to accelerate inference in large language models without sacrificing sampling fidelity.

Abstract

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework \emph{Kangaroo}, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.
Paper Structure (17 sections, 4 equations, 3 figures, 2 tables)

This paper contains 17 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparison of various self-drafting speculative decoding methods on Spec-Bench xia2024unlocking for Vicuna-7B chiang2023vicuna. Kangaroo outperforms all other methods w.r.t. end-to-end speedup ratio across all the four subtasks. For more detailed comparison on full Spec-Bench, see Table \ref{['tab:benchmark']}.
  • Figure 2: The framework of Kangaroo. The adapter network $\mathcal{A}$ consists of only one multi-head attention vaswani2017attention and two normalization layers zhang2019root. The self-draft model $\mathcal{M}^s = \mathcal{A} \circ \mathcal{M}^b[:l]$ will reuse the LM Head of the target LLM $\mathcal{M}^b$, where $l$ denotes the early exit layer. To avoid unnecessary costs on more difficult tokens, $\mathcal{M}^s$ stops drafting once the confidence level of the current token falls below a certain threshold, e.g., $\mathcal{M}^s(x_3^{\prime}) \le \eta$. Note that we will concatenate the stopped token's next early feature$f_3$ with all previous exited features into a parallel compute unit $[f_0, f_1,\cdots, f_3]$, which will be verified by the remaining layers $\mathcal{M}^b[l: ]$ in parallel. Once all drafted tokens are accepted ($x_i^{\prime} = x_i$ for $i = 1,2,3$), we could start the next round with $x_4$ rather than $x_3$ if we have not calculated $f_3$ in advance. The decoding on parallel compute unit $[f_3, f_4]$ could save the latency for a single forward pass of the adapter network $\mathcal{A}$.
  • Figure 3: Ablation studies on hyper-parameters. The compression rate and walltime speedup is averaged across all sub-benchmarks in Spec-Bench.

Theorems & Definitions (1)

  • Definition 1