Table of Contents
Fetching ...

Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou

TL;DR

This work targets fault localization in the Linux kernel, a notably challenging large-scale software system. It first introduces LinuxFLBench, a 250-task benchmark derived from real-world Linux kernel bugs to evaluate LLM-based FL methods under realistic conditions. The study finds that state-of-the-art LLM agents underperform on the kernel, with top-1 file-level accuracy around 41.6% and substantial drops compared to general software benchmarks. To address these challenges, the authors propose LinuxFL$^+$, an enhancement framework combining Directory-Aware Expansion and Potential Cause Expansion (Direct and Mail-Augmented) with a Candidate Integration step, yielding 7.2–11.2 percentage point improvements across agents at minimal cost. Overall, the work highlights kernel-specific FL hurdles and offers a practical, generalizable approach to improve bug localization, with the dataset and code openly available for further research.

Abstract

The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.

Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

TL;DR

This work targets fault localization in the Linux kernel, a notably challenging large-scale software system. It first introduces LinuxFLBench, a 250-task benchmark derived from real-world Linux kernel bugs to evaluate LLM-based FL methods under realistic conditions. The study finds that state-of-the-art LLM agents underperform on the kernel, with top-1 file-level accuracy around 41.6% and substantial drops compared to general software benchmarks. To address these challenges, the authors propose LinuxFL, an enhancement framework combining Directory-Aware Expansion and Potential Cause Expansion (Direct and Mail-Augmented) with a Candidate Integration step, yielding 7.2–11.2 percentage point improvements across agents at minimal cost. Overall, the work highlights kernel-specific FL hurdles and offers a practical, generalizable approach to improve bug localization, with the dataset and code openly available for further research.

Abstract

The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.

Paper Structure

This paper contains 35 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Task Distribution across Products
  • Figure 2: Performance of LLM agents on SWE-bench and LinuxFLBench.
  • Figure 3: Venn Diagram for Correctly Localized Bugs by LLM agents.
  • Figure 4: Overview of LinuxFL$^+$.
  • Figure 5: Construction pipeline of LinuxFLBench.
  • ...and 4 more figures