KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Jiayi Nie; Haoran Wu; Yao Lai; Zeyu Cao; Cheng Zhang; Binglei Lou; Erwei Wang; Jianyi Cheng; Timothy M. Jones; Robert Mullins; Rika Antonova; Yiren Zhao

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, Yiren Zhao

TL;DR

KernelCraft is presented: the first benchmark to evaluate an LLM agent's ability to generate and optimize low-level kernels for customized accelerators via a function-calling, feedback-driven workflow, and the potential for reducing the cost of kernel development for accelerator designers and kernel developers is demonstrated.

Abstract

New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark to evaluate an LLM agent's ability to generate and optimize low-level kernels for customized accelerators via a function-calling, feedback-driven workflow. Within KernelCraft, the agent refines kernels under ISA and hardware constraints using automated feedback derived from compilation checks, simulation, and correctness validation against ground truth. In our experiments, we assess agent performance across three emerging accelerator platforms on more than 20 ML tasks, each with 5 diverse task configurations, with special evaluation of task configuration complexity. Across four leading reasoning models, top agents produce functionally valid kernels for previously unseen ISAs within a few refinement steps, with optimized kernels that match or outperform template-based compiler baselines. With that, we demonstrate the potential for reducing the cost of kernel development for accelerator designers and kernel developers.

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

TL;DR

Abstract

Paper Structure (43 sections, 8 figures, 15 tables)

This paper contains 43 sections, 8 figures, 15 tables.

Introduction
Preliminaries
Hardware kernel
Tool-use-based LLM agents
KernelCraft
Hardware targets and kernel tasks
Hardware targets.
Kernel tasks.
Evaluation
Experiments
Task success rate
Kernel performance
Discussion
Extended reasoning is essential for hard kernel generation tasks.
In-context learning is critical when ISA documentation is scarce.
...and 28 more sections

Figures (8)

Figure 1: Overview of KernelCraft. Generation tasks in KernelCraft span three levels of workloads: primitive operations, composite operations, and end-to-end systems. When using an LLM-based agent for kernel generation, we provide the task description, ISA specification, and hardware configuration as inputs. During generation, the agent can leverage the provided tools for debugging and iterative refinement.
Figure 2: KernelCraft benchmarks an LLM agent for accelerator assembly-kernel generation in a diagnosis-and-repair loop. Starting from workload/ISA/hardware specifications, the agent writes an assembly kernel that is automatically saved and verified by KernelCraft using syntax checks and reference-based functional checks. When mismatches are detected, KernelCraft performs memory-level diff diagnostics to localize possible errors and feeds the signals back to the agent for iterative patching, repeating until the kernel meets correctness criteria (e.g., elementwise numerical tolerance).
Figure 3: Speedup of best KernelCraft agent's kernels over compiler baselines on representative workloads of varying complexity across three accelerator platforms (PLENA: native compiler, Coral: RVV -O2, AMD: Peano)
Figure 4: Average token usage per workload across four LLMs on PLENA (5 runs each). Bars show per-run averages decomposed into system prompt, input, reasoning (GPT-5.2 and DeepSeek R1 only), and output tokens. Claude Sonnet 4 and Gemini 3 Flash include reasoning tokens within the output token count. Success rates are shown above each bar.
Figure 5: Average token usage per workload across four LLMs on Coral NPU (5 runs each). Bars show per-run averages decomposed into system prompt, input, reasoning (GPT-5.2 and DeepSeek R1 only), and output tokens. Claude Sonnet 4 and Gemini 3 Flash include reasoning tokens within the output token count. Success rates are shown above each bar.
...and 3 more figures

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

TL;DR

Abstract

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Authors

TL;DR

Abstract

Table of Contents

Figures (8)