Table of Contents
Fetching ...

Accelerating OpenPangu Inference on NPU via Speculative Decoding

Yuntao Dai, Jing Wu, Hang Gu, Teng Wang

TL;DR

This study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B to mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on NPU hardware.

Abstract

To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.

Accelerating OpenPangu Inference on NPU via Speculative Decoding

TL;DR

This study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B to mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on NPU hardware.

Abstract

To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.
Paper Structure (15 sections, 3 equations, 4 figures, 2 tables)

This paper contains 15 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Micro-architecture of the NPU Processor. (a) The Core separates Matrix (Cube) and Vector operations, necessitating distinct buffer management (L0A/L0B/L0C). (b) The detailed data flow shows that instruction dispatch and memory movement are highly pipelined. This hardware design heavily favors Static Shape execution, where data movement paths are determined at compile-time, contrasting with the dynamic scheduling flexibility of GPUs.
  • Figure 2: Overview of the OpenPangu-with-Medusa architecture adapted for NPU. The system features (a) a frozen OpenPangu backbone with lightweight MLP heads for block-wise token prediction, and (b) a Static Tree Verification module. The latter utilizes pre-computed topological buffers (medusa_attn_mask) to enable zero-copy path retrieval, ensuring compatibility with the NPU's static graph execution model.
  • Figure 3: End-to-End Speedup Comparison (NPU vs. NVIDIA A6000). The proposed method achieves a peak speedup of 1.35$\times$ on NPU for short sequences ($L=128$), significantly outperforming the unoptimized GPU baseline. However, the speedup on NPU exhibits a downward trend as sequence length increases, eventually crossing the break-even point at $L=1024$.
  • Figure 4: Computational Overhead Analysis. The graph illustrates the ratio of speculative decoding time to standard autoregressive time. The NPU (blue line) shows a steeper slope compared to the GPU (red dashed line), indicating higher sensitivity to memory access patterns. This non-linear growth in overhead is the primary factor limiting speedup in long-context scenarios on the NPU.