MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Xingze Zou; Jing Wang; Yuhua Zheng; Xueyi Chen; Haolei Bai; Lingcheng Kong; Syed A. R. Abu-Bakar; Zhaode Wang; Chengfei Lv; Haoji Hu; Huan Wang

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Xingze Zou, Jing Wang, Yuhua Zheng, Xueyi Chen, Haolei Bai, Lingcheng Kong, Syed A. R. Abu-Bakar, Zhaode Wang, Chengfei Lv, Haoji Hu, Huan Wang

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile de- vices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross-framework interoperability, coupled with an automated pipeline that bridges the host-device gap for on-device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inher-ent to mobile frameworks; standard models and even fine-tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain-specific grounding. To overcome these limitations, we propose the Mobile K ernel A gent (MoKA), a multi-agent system equipped with repository-aware reasoning and a plan-and-execute paradigm.Validated on MobileKernelBench, MoKA achieves state-of-the-art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernelsto deliver measurable speedups over native libraries.

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Abstract

Paper Structure (24 sections, 2 equations, 7 figures, 6 tables)

This paper contains 24 sections, 2 equations, 7 figures, 6 tables.

Introduction
Related Work
LLM for Kernel Generation
Mobile Inference Engine
MobileKernelBench
Data Curation
Evaluation Pipeline
MoKA
Agent Collaboration Design
Agentic Toolset
Experiments
Experiment Setups
Baseline Evaluation
MoKA Results
Conclusion
...and 9 more sections

Figures (7)

Figure 1: Performance evaluation on MobileKernelBench across three metrics: compilation success rate (CSR), functional correctness rate (FCR), and performance speedup (fast$_p$) ouyang2025kernelbench. (a) Baseline LLM performance: We benchmark prevalent open- and closed-source LLMs, revealing significant shortcomings in their ability to generate functional and efficient mobile kernels. (b) Method comparison: We compare our proposed MoKA against common training methods, including LoRA and GRPO. The red circle (marked at 50%) corresponds to the outer limit of plot (a), highlighting that MoKA achieves substantial improvements, surpassing the performance ceiling of both baseline models and naive fine-tuning approaches.
Figure 2: Overview of our proposed framework. The system consists of two core components: (a) MobileKernelBench, which establishes the evaluation environment by integrating a target-driven data curation process with an automated, hardware-in-the-loop evaluation pipeline; and (b) MoKA, a multi-role agentic system where Coder, Debugger, and Accelerator agents collaborate to iteratively generate and refine kernels based on feedback from the benchmark.
Figure 3: Success rate degradation across evaluation stages. The plot illustrates the performance drop of each model as the evaluation criteria become stricter, from compilation to functional correctness and varying levels of performance optimization.
Figure 4: Performance comparison of SOTA LLMs on MobileKernelBench. The best and second-best results are highlighted in bold and underlined respectively. Some model names are abbreviated for layout purposes, with full identifiers provided in \ref{['sec:baseline_result']}.
Figure 4: Fine-grained performance of LLMs across different operator categories. We visualize the evaluation metrics for five representative operator types. The results highlight significant disparities in model capabilities when handling operators with varying levels of algorithmic complexity.
...and 2 more figures

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Abstract

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Authors

Abstract

Table of Contents

Figures (7)