Table of Contents
Fetching ...

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Yujie Zheng, Zhuo Li, Shengtao Zhang, Hanjing Wang, Junjie Sheng, Jiaqian Wang, Junchi Yan, Weinan Zhang, Ying Wen, Bo Tang, Muning Wen

TL;DR

EvoKernel is introduced, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining and demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems.

Abstract

Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models' correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at https://evokernel.zhuo.li.

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

TL;DR

EvoKernel is introduced, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining and demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems.

Abstract

Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models' correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at https://evokernel.zhuo.li.
Paper Structure (39 sections, 5 theorems, 11 equations, 8 figures, 7 tables)

This paper contains 39 sections, 5 theorems, 11 equations, 8 figures, 7 tables.

Key Result

Lemma 1

Suppose $|R_t| \le R_{\max}$ for all $t$ almost surely, and $\alpha_t \in (0, 1]$. If $Q_0 \in [-R_{\max}, R_{\max}]$, then $Q_t \in [-R_{\max}, R_{\max}]$ for all $t$.

Figures (8)

  • Figure 1: The EvoKernel framework. (Left) Cold-Start Drafting: Given task batch $\mathcal{X}$, retrieves top-$k$ candidates, filters context via $Q$, and synthesizes an initial kernel. (Center) Environment & Memory: A multi-gate verifier assesses generated code to yield rewards, which update $Q$ via value iteration; code and results are stored in Memory. (Right) Continual Refining: Exploits generation traces $\mathcal{P}(x)$ and historical attempts, including observable child nodes, to iteratively optimize for lower latency.
  • Figure 2: Optimization outcomes. (Left) Category-level correctness and speedup distribution at budget $T{=}30$; color segments show the fraction of correct kernels in each speedup tier relative to Torch-NPU. (Right) Within-operator speedup achieved by iterative refinement across 159 operators with $\geq$1 valid optimization candidate beyond the initial correct draft; inset panels detail representative optimization trajectories.
  • Figure 3: Transfer and generalization. (Left) Transfer across difficulty levels: cumulative success rate on L2 under different stream compositions. (Right) Transfer across generator backbones: performance on held-out operators when reusing memory built with GPT-5.2.
  • Figure 4: mHC Kernels (Ascend): Optimization timeline and performance vs. Torch-NPU baseline for 15 DeepSeek mHC operators over 30 iterations (merged across three experiment series). (Left) Correctness and performance optimization timeline. (Right) Best correct run vs. baseline in log$_2$ speedup.
  • Figure 5: Retrieval ablations. (Left) Value-driven vs. heuristic retrieval on L2 operators (same L1 memory and $\epsilon$-greedy schedule). (Right) Effect of increasing retrieval pool size $K$ at iteration 24; cumulative correctness and compilation rates on L1 operators.
  • ...and 3 more figures

Theorems & Definitions (13)

  • Lemma 1: Bounded Rewards Imply Bounded Values
  • proof
  • Corollary 2: Boundedness of Raw Optimization Reward
  • proof
  • Remark 3: Z-Score Normalization Requires Safeguards
  • Remark 4: Error Clipping Alone Is Insufficient
  • Lemma 5: Convergence of Running Statistics
  • proof
  • Remark 6: Relation to PopArt
  • Lemma 7: Constant Step Size: EMA Dynamics
  • ...and 3 more