Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Yujie Zheng; Zhuo Li; Shengtao Zhang; Hanjing Wang; Junjie Sheng; Jiaqian Wang; Junchi Yan; Weinan Zhang; Ying Wen; Bo Tang; Muning Wen

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Yujie Zheng, Zhuo Li, Shengtao Zhang, Hanjing Wang, Junjie Sheng, Jiaqian Wang, Junchi Yan, Weinan Zhang, Ying Wen, Bo Tang, Muning Wen

TL;DR

EvoKernel is introduced, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining and demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems.

Abstract

Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models' correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at https://evokernel.zhuo.li.

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

TL;DR

Abstract

Paper Structure (39 sections, 5 theorems, 11 equations, 8 figures, 7 tables)

This paper contains 39 sections, 5 theorems, 11 equations, 8 figures, 7 tables.

Introduction
Related Work
EvoKernel: Value-Driven Memory Update for Kernel Evolution
Problem Formulation
Memory Architecture and Value-Driven Retrieval
Stage 1: Cold-Start Drafting
Stage 2: Continual Refining
Multi-gate Verification
Experiment
Experimental Setup
Main Results
Generalization of Value-Driven Memory
Beyond KernelBench and CANN
Ablations
Value-Driven versus Heuristic-Driven Retrieval
...and 24 more sections

Key Result

Lemma 1

Suppose $|R_t| \le R_{\max}$ for all $t$ almost surely, and $\alpha_t \in (0, 1]$. If $Q_0 \in [-R_{\max}, R_{\max}]$, then $Q_t \in [-R_{\max}, R_{\max}]$ for all $t$.

Figures (8)

Figure 1: The EvoKernel framework. (Left) Cold-Start Drafting: Given task batch $\mathcal{X}$, retrieves top-$k$ candidates, filters context via $Q$, and synthesizes an initial kernel. (Center) Environment & Memory: A multi-gate verifier assesses generated code to yield rewards, which update $Q$ via value iteration; code and results are stored in Memory. (Right) Continual Refining: Exploits generation traces $\mathcal{P}(x)$ and historical attempts, including observable child nodes, to iteratively optimize for lower latency.
Figure 2: Optimization outcomes. (Left) Category-level correctness and speedup distribution at budget $T{=}30$; color segments show the fraction of correct kernels in each speedup tier relative to Torch-NPU. (Right) Within-operator speedup achieved by iterative refinement across 159 operators with $\geq$1 valid optimization candidate beyond the initial correct draft; inset panels detail representative optimization trajectories.
Figure 3: Transfer and generalization. (Left) Transfer across difficulty levels: cumulative success rate on L2 under different stream compositions. (Right) Transfer across generator backbones: performance on held-out operators when reusing memory built with GPT-5.2.
Figure 4: mHC Kernels (Ascend): Optimization timeline and performance vs. Torch-NPU baseline for 15 DeepSeek mHC operators over 30 iterations (merged across three experiment series). (Left) Correctness and performance optimization timeline. (Right) Best correct run vs. baseline in log$_2$ speedup.
Figure 5: Retrieval ablations. (Left) Value-driven vs. heuristic retrieval on L2 operators (same L1 memory and $\epsilon$-greedy schedule). (Right) Effect of increasing retrieval pool size $K$ at iteration 24; cumulative correctness and compilation rates on L1 operators.
...and 3 more figures

Theorems & Definitions (13)

Lemma 1: Bounded Rewards Imply Bounded Values
proof
Corollary 2: Boundedness of Raw Optimization Reward
proof
Remark 3: Z-Score Normalization Requires Safeguards
Remark 4: Error Clipping Alone Is Insufficient
Lemma 5: Convergence of Running Statistics
proof
Remark 6: Relation to PopArt
Lemma 7: Constant Step Size: EMA Dynamics
...and 3 more

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

TL;DR

Abstract

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (13)