K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

Shiyi Cao; Ziming Mao; Joseph E. Gonzalez; Ion Stoica

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica

TL;DR

K-Search significantly outperforms state-of-the-art evolutionary search methods, and explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects.

Abstract

Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co-Evolving World Model and build K-Search based on this method. By replacing static search heuristics with a co-evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K-Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K-Search significantly outperforms state-of-the-art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K-Search achieves state-of-the-art performance on H100, reaching 1030us and surpassing both prior evolution and human-designed solutions.

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

TL;DR

Abstract

Paper Structure (44 sections, 6 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 44 sections, 6 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Fast iteration and specialized kernel libraries.
Compiler autotuning, DSLs, and kernel optimization.
LLMs for GPU kernel generation.
LLM-guided evolutionary and population-based program search.
Large Language Models as World Models
K-Search
Problem Setup
Search via Co-Evolving World Model
Baseline: Heuristic Search in Program Space.
Ours: Search via Co-Evolving World Model.
System Design
Search State.
Execution: Local Refinement.
...and 29 more sections

Figures (4)

Figure 1: Overview of K-Search. The framework operates on a Search State $S_t$ structured as a search tree. The tree consists of Closed nodes (blue, visited states with attached program like $x_{12}$) and a Frontier of Open nodes (orange, pending hypotheses like $u_{13}$). The workflow iterates through three phases: (1) Action Selection, where the most promising action node is retrieved from the frontier based on world model estimated priority score $V$; (2) Local Refinement, where a stochastic policy $\pi_{\text{code}}$ samples concrete implementations until stagnation; and (3) World Model Update, where the LLM reasons over the trajectory to update the search tree via Insert (adding new actions), Update (adjusting $V$, e.g., $u_{11}$ dropping from 0.9 to 0.6), and Prune (removing less promising nodes like $u_{10}$).
Figure 2: K-Search Search Trace Visualization. It tracks the evolution of the Search State across search rounds on the MLA Paged Decode kernel (refer to \ref{['sec:experiments']} for setup details). A round corresponds to one candidate program evaluation. Nodes represent actions (blue=Closed, orange=Open), annotated with their instantiated program performance (closed nodes) or priority scores (open nodes). The timeline highlights how the kernel is improved and how the LLM dynamically Inserts new hypotheses, Updates beliefs, and Prunes less promising branches based on evolved understanding.
Figure 3: Main Results (3 runs each). (a) compares the kernels best-so-far scores generated by the three systems across 120 iterations. (b) provides a per-workload analysis for all compared systems. (c) shows the fraction of workloads for which the best kernel from each system achieves the specified speedup over the FlashInfer baseline.
Figure : K-Search: Search via Co-Evolving World Models

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

TL;DR

Abstract

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)