Table of Contents
Fetching ...

A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization

Sungmin Kang, Gabin An, Shin Yoo

TL;DR

AutoFL tackles the need for explainable fault localization by combining LLM-based reasoning with repository navigation via function calls to overcome context length limits. It outputs root-cause explanations and precise fault locations at the method level across Java and Python. The study shows AutoFL matching or surpassing SBFL/MBFL baselines on real-world bugs and demonstrates that developers value natural language explanations and prefer a few high-quality insights. It also introduces a confidence-aware mechanism to estimate result reliability and to filter outputs, pointing to practical adoption paths for explainable debugging.

Abstract

Fault Localization (FL), in which a developer seeks to identify which part of the code is malfunctioning and needs to be fixed, is a recurring challenge in debugging. To reduce developer burden, many automated FL techniques have been proposed. However, prior work has noted that existing techniques fail to provide rationales for the suggested locations, hindering developer adoption of these techniques. With this in mind, we propose AutoFL, a Large Language Model (LLM)-based FL technique that generates an explanation of the bug along with a suggested fault location. AutoFL prompts an LLM to use function calls to navigate a repository, so that it can effectively localize faults over a large software repository and overcome the limit of the LLM context length. Extensive experiments on 798 real-world bugs in Java and Python reveal AutoFL improves method-level acc@1 by up to 233.3% over baselines. Furthermore, developers were interviewed on their impression of AutoFL-generated explanations, showing that developers generally liked the natural language explanations of AutoFL, and that they preferred reading a few, high-quality explanations instead of many.

A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization

TL;DR

AutoFL tackles the need for explainable fault localization by combining LLM-based reasoning with repository navigation via function calls to overcome context length limits. It outputs root-cause explanations and precise fault locations at the method level across Java and Python. The study shows AutoFL matching or surpassing SBFL/MBFL baselines on real-world bugs and demonstrates that developers value natural language explanations and prefer a few high-quality insights. It also introduces a confidence-aware mechanism to estimate result reliability and to filter outputs, pointing to practical adoption paths for explainable debugging.

Abstract

Fault Localization (FL), in which a developer seeks to identify which part of the code is malfunctioning and needs to be fixed, is a recurring challenge in debugging. To reduce developer burden, many automated FL techniques have been proposed. However, prior work has noted that existing techniques fail to provide rationales for the suggested locations, hindering developer adoption of these techniques. With this in mind, we propose AutoFL, a Large Language Model (LLM)-based FL technique that generates an explanation of the bug along with a suggested fault location. AutoFL prompts an LLM to use function calls to navigate a repository, so that it can effectively localize faults over a large software repository and overcome the limit of the LLM context length. Extensive experiments on 798 real-world bugs in Java and Python reveal AutoFL improves method-level acc@1 by up to 233.3% over baselines. Furthermore, developers were interviewed on their impression of AutoFL-generated explanations, showing that developers generally liked the natural language explanations of AutoFL, and that they preferred reading a few, high-quality explanations instead of many.
Paper Structure (26 sections, 3 equations, 8 figures, 6 tables)

This paper contains 26 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Diagram of AutoFL. Arrows represent an interaction between components; circled numbers indicate the order. Function interactions are made at most N times, where N is a predetermined parameter of AutoFL.
  • Figure 2: Scoring and ranking candidate methods (depicted as black rectangles) based on five AutoFL prediction outcomes (depicted as colored rectangles). For every prediction outcome, the scores (represented as colored circles) are evenly distributed among all the methods included in that particular outcome.
  • Figure 3: Performance of various FL techniques on the Defects4J and BugsInPy benchmarks.
  • Figure 4: Performance of AutoFL as $R$ increases, for Defects4J and BugsInPy.
  • Figure 5: Function call distribution for AutoFL-GPT3.5.
  • ...and 3 more figures