Rethinking Kernel Program Repair: Benchmarking and Enhancing LLMs with RGym
Kareem Shehada, Yifan Wu, Wyatt D. Feng, Adithya Iyer, Gryphon Kumfert, Yangruibo Ding, Zhiyun Qian
TL;DR
This work introduces RGym, a lightweight, local-hardware framework for evaluating LLM-driven kernel program repair (APR) that avoids cloud dependencies and unrealistic oracle assumptions. By combining bug-inducing commit (BIC) localization, call-stack insights, and function-wise patching, RGym delivers strong patch-pass rates, notably achieving 43.36% with GPT-5 Thinking at a low per-bug cost (~$0.18–$0.20) and an aggregated ~68.5% pass rate across configurations at ~$1.33 per bug. An ablation study demonstrates the contributions of localization, prompts, and model choice, and shows that feedback retries significantly boost success. The findings suggest that simpler, locality-driven APR pipelines can match or exceed the performance of more complex, cloud-based systems like CrashFixer, while drastically reducing cost and infrastructure demands, thereby improving accessibility for kernel APR research and development.
Abstract
Large Language Models (LLMs) have revolutionized automated program repair (APR) but current benchmarks like SWE-Bench predominantly focus on userspace applications and overlook the complexities of kernel-space debugging and repair. The Linux kernel poses unique challenges due to its monolithic structure, concurrency, and low-level hardware interactions. Prior efforts such as KGym and CrashFixer have highlighted the difficulty of APR in this domain, reporting low success rates or relying on costly and complex pipelines and pricey cloud infrastructure. In this work, we introduce RGym, a lightweight, platform-agnostic APR evaluation framework for the Linux kernel designed to operate on local commodity hardware. Built on RGym, we propose a simple yet effective APR pipeline leveraging specialized localization techniques (e.g., call stacks and blamed commits) to overcome the unrealistic usage of oracles in KGym. We test on a filtered and verified dataset of 143 bugs. Our method achieves up to a 43.36% pass rate with GPT-5 Thinking while maintaining a cost of under $0.20 per bug. We further conduct an ablation study to analyze contributions from our proposed localization strategy, prompt structure, and model choice, and demonstrate that feedback-based retries can significantly enhance success rates.
