Table of Contents
Fetching ...

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Jiehao Wu, Zixiao Huang, Wenhao Li, Chuyun Shen, Junjie Sheng, Xiangfeng Wang

Abstract

AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact - a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions. We present AscendOptimizer, an episodic agent that bootstraps this missing expertise by turning execution into experience. On the host side, AscendOptimizer performs profiling-in-the-loop evolutionary search to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback. On the kernel side, it mines transferable optimization motifs by rewinding optimized kernels - systematically de-optimizing them to synthesize instructive "bad-to-good" trajectories - and distills these motifs into a retrievable experience bank for guided rewriting. By alternating host tuning and kernel rewriting in a closed loop, AscendOptimizer steadily expands feasibility and pushes latency down. On a benchmark of 127 real AscendC operators, AscendOptimizer achieves a 1.19x geometric-mean speedup over the open-source baseline, with 49.61% of operators outperforming their references, outperforming strong agent and search baselines.

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

Abstract

AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact - a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions. We present AscendOptimizer, an episodic agent that bootstraps this missing expertise by turning execution into experience. On the host side, AscendOptimizer performs profiling-in-the-loop evolutionary search to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback. On the kernel side, it mines transferable optimization motifs by rewinding optimized kernels - systematically de-optimizing them to synthesize instructive "bad-to-good" trajectories - and distills these motifs into a retrievable experience bank for guided rewriting. By alternating host tuning and kernel rewriting in a closed loop, AscendOptimizer steadily expands feasibility and pushes latency down. On a benchmark of 127 real AscendC operators, AscendOptimizer achieves a 1.19x geometric-mean speedup over the open-source baseline, with 49.61% of operators outperforming their references, outperforming strong agent and search baselines.
Paper Structure (34 sections, 3 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 3 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of AscendOptimizer. Stage i@ performs evolutionary-guided program search with hardware-in-the-loop profiling feedback to discover valid high-performance configurations; Stage ii@ bootstraps optimization experience via optimization rewind and applies retrieval-augmented kernel optimization to address structural bottlenecks. The two stages are executed in an alternating loop, where improvements from one stage feed into the other for progressive end-to-end optimization.
  • Figure 2: CDF of per-operator speedups achieved by AscendOptimizer on 63 optimized operators. The x-axis is the speedup over the baseline and the y-axis is the cumulative fraction of operators. Dashed markers highlight the corresponding tail ratios: 39.7% of operators achieve at least $1.1\times$, 30.2% achieve at least $1.2\times$, 19.0% achieve at least $1.5\times$, and 14.3% achieve at least $2.0\times$.
  • Figure 3: Semantic landscape of optimization strategies via embedding clustering. Each optimization record (Title & Description) is embedded using an embedding model, projected to 2D for visualization with PCA, and clustered with K-Means. Grey/light regions denote clusters aligned with categories described in the official documentation, while red regions denote clusters that do not directly correspond to the documentation’s explicit taxonomy.
  • Figure 4: Optimization trajectory of the "foreach_pow_scalar_and_tensor" operator.
  • Figure 5: Illustration of a key scheduling rewrite in foreach_ pow_ scalar_ and_ tensor: (a) the original remainder-based per-core quota assignment; (b) the optimized block-level load balancing with a nested scan across tensors (changes highlighted in red).
  • ...and 2 more figures