EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

Chang Han; Yijie Hu; Jingling Liu

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

Chang Han, Yijie Hu, Jingling Liu

TL;DR

EAGLE-Pangu is presented, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs and provides an explicit branch/commit cache manager built on the Cache API and a fused-kernel-compatible teacher verification path with a debuggable eager fallback.

Abstract

Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where attention masking, KV-cache layouts, and indexing semantics are not interchangeable. We present EAGLE-Pangu, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs. EAGLE-Pangu contributes (i) an explicit branch/commit cache manager built on the Cache API, (ii) accelerator-safe tree tensorization that removes undefined negative indices by construction and validates structural invariants, and (iii) a fused-kernel-compatible teacher verification path with a debuggable eager fallback. On 240 turns from MT-Bench and HumanEval-style prompts, EAGLE-Pangu improves end-to-end decoding throughput by 1.27x on average, up to 2.46x at p99, over teacher-only greedy decoding in the fused-kernel performance path. We also provide a fused-kernel-free reference path with structured traces and invariant checks to support reproducible debugging and ablation across execution modes and tree budgets.

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

TL;DR

Abstract

Paper Structure (47 sections, 14 equations, 15 figures, 3 tables)

This paper contains 47 sections, 14 equations, 15 figures, 3 tables.

Introduction
Background
Auto-regressive decoding and KV caching
Speculative decoding
Tree-structured speculative decoding
Tree attention masking
Tensor indexing and gather semantics in tree decoding
Problem setting and evaluation perspective
Method
Branchable KV-cache abstraction
Commit by path indices and prefix-sharing fast reorder.
Accelerator-safe tree tensor semantics
Node linearization and base arrays.
Dummy-root indexing (sentinel-free gathers).
Ancestor tables for path-structured operations.
...and 32 more sections

Figures (15)

Figure 3: Position-wise acceptance (accept_pos) aggregated over the evaluation set. Later draft positions are harder to accept, explaining the long-tail behavior of $L_k$.
Figure 6: Throughput speedup under drafter-only fixed-window truncation. Smaller windows reduce $L_k$ and degrade end-to-end speed.
Figure 7: Draft attention evidence (instrumented; analysis-only): the top-1 attention location frequently lies in far history ($\texttt{256\_plus}$ bucket), consistent with truncation harming draft quality and acceptance.
Figure : Prompt length distribution.
Figure : (a) Speedup distribution.
...and 10 more figures

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

TL;DR

Abstract

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (15)