Table of Contents
Fetching ...

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

Chang Han, Yijie Hu, Jingling Liu

TL;DR

EAGLE-Pangu is presented, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs and provides an explicit branch/commit cache manager built on the Cache API and a fused-kernel-compatible teacher verification path with a debuggable eager fallback.

Abstract

Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where attention masking, KV-cache layouts, and indexing semantics are not interchangeable. We present EAGLE-Pangu, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs. EAGLE-Pangu contributes (i) an explicit branch/commit cache manager built on the Cache API, (ii) accelerator-safe tree tensorization that removes undefined negative indices by construction and validates structural invariants, and (iii) a fused-kernel-compatible teacher verification path with a debuggable eager fallback. On 240 turns from MT-Bench and HumanEval-style prompts, EAGLE-Pangu improves end-to-end decoding throughput by 1.27x on average, up to 2.46x at p99, over teacher-only greedy decoding in the fused-kernel performance path. We also provide a fused-kernel-free reference path with structured traces and invariant checks to support reproducible debugging and ablation across execution modes and tree budgets.

EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

TL;DR

EAGLE-Pangu is presented, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs and provides an explicit branch/commit cache manager built on the Cache API and a fused-kernel-compatible teacher verification path with a debuggable eager fallback.

Abstract

Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where attention masking, KV-cache layouts, and indexing semantics are not interchangeable. We present EAGLE-Pangu, a reproducible system that ports EAGLE-3-style tree speculative decoding to a Pangu teacher backend on Ascend NPUs. EAGLE-Pangu contributes (i) an explicit branch/commit cache manager built on the Cache API, (ii) accelerator-safe tree tensorization that removes undefined negative indices by construction and validates structural invariants, and (iii) a fused-kernel-compatible teacher verification path with a debuggable eager fallback. On 240 turns from MT-Bench and HumanEval-style prompts, EAGLE-Pangu improves end-to-end decoding throughput by 1.27x on average, up to 2.46x at p99, over teacher-only greedy decoding in the fused-kernel performance path. We also provide a fused-kernel-free reference path with structured traces and invariant checks to support reproducible debugging and ablation across execution modes and tree budgets.
Paper Structure (47 sections, 14 equations, 15 figures, 3 tables)

This paper contains 47 sections, 14 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 3: Position-wise acceptance (accept_pos) aggregated over the evaluation set. Later draft positions are harder to accept, explaining the long-tail behavior of $L_k$.
  • Figure 6: Throughput speedup under drafter-only fixed-window truncation. Smaller windows reduce $L_k$ and degrade end-to-end speed.
  • Figure 7: Draft attention evidence (instrumented; analysis-only): the top-1 attention location frequently lies in far history ($\texttt{256\_plus}$ bucket), consistent with truncation harming draft quality and acceptance.
  • Figure : Prompt length distribution.
  • Figure : (a) Speedup distribution.
  • ...and 10 more figures