Table of Contents
Fetching ...

Where's the Bug? Attention Probing for Scalable Fault Localization

Adam Stein, Arthur Wayne, Aaditya Naik, Mayur Naik, Eric Wong

TL;DR

This work introduces Bug Attention Probe (BAP), a lightweight LLM probing approach for scalable fault localization that uses only coarse bug-detection supervision and no localization labels. By training a single-layer transformer to convert token representations from a frozen LLM into token-level attention, which is then aggregated to line-level bug scores, BAP achieves state-of-the-art top-1 localization accuracy across eight diverse benchmarks with small base models and far lower compute than large LLM prompting. It demonstrates strong performance on multi-line bugs, good length generalization, and the ability to generalize to new bugs and languages, while maintaining high efficiency. The approach offers practical benefits for code auditing and LLM-based repair workflows by reducing reliance on expensive models and costly, fine-grained annotations.

Abstract

Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks. While LLM-based program repair systems can propose bug fixes using only a user's bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs. Existing FL approaches rely on executable test cases, require training on costly and often noisy line-level annotations, or demand resource-intensive LLMs. In this paper, we present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels, outperforming traditional FL baselines and prompting of large-scale LLMs. We evaluate our approach across a variety of code settings, including real-world Java bugs from the standard Defects4J dataset as well as seven other datasets which span a diverse set of bug types and languages. Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4o. BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.

Where's the Bug? Attention Probing for Scalable Fault Localization

TL;DR

This work introduces Bug Attention Probe (BAP), a lightweight LLM probing approach for scalable fault localization that uses only coarse bug-detection supervision and no localization labels. By training a single-layer transformer to convert token representations from a frozen LLM into token-level attention, which is then aggregated to line-level bug scores, BAP achieves state-of-the-art top-1 localization accuracy across eight diverse benchmarks with small base models and far lower compute than large LLM prompting. It demonstrates strong performance on multi-line bugs, good length generalization, and the ability to generalize to new bugs and languages, while maintaining high efficiency. The approach offers practical benefits for code auditing and LLM-based repair workflows by reducing reliance on expensive models and costly, fine-grained annotations.

Abstract

Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks. While LLM-based program repair systems can propose bug fixes using only a user's bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs. Existing FL approaches rely on executable test cases, require training on costly and often noisy line-level annotations, or demand resource-intensive LLMs. In this paper, we present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels, outperforming traditional FL baselines and prompting of large-scale LLMs. We evaluate our approach across a variety of code settings, including real-world Java bugs from the standard Defects4J dataset as well as seven other datasets which span a diverse set of bug types and languages. Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4o. BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of our approach Bug Attention Probe (BAP) with baselines DeepFL and LLM prompting on a Java program snippet. The program has two bugs: the age condition on line 3 is reversed and line 6 throws a null pointer exception. BAP correctly localizes both bugs. Here, our method is trained on Llama-3.2-1B, a "small" language model (SLM), with only weak supervision i.e. binary bug presence labels. Obtaining comparable accuracy via prompting demands a significantly more resource-intensive LLM, such as Llama-3.2-90B, or even larger. Previous approaches to fault localization like DeepFL require executable test cases before they can attempt to provide useful information.
  • Figure 2: Illustration of BAP as a method to elicit line-level fault localization from a frozen LLM through weak supervision. In step one, the probe is trained as a binary classifier to distinguish buggy from non-buggy code. Then in step two, we visualize the learned attention weights on the given sequence. Finally, in step three, we sum the attention weights within each line to produce a line-level "bugginess" score. BAP localizes the bug to the line with the highest score, the Top-1 result.
  • Figure 3: Model scale versus Top-1 on Defects4J. Each point for BAP is trained on the hidden representations from the Llama-3.2 model of the corresponding size.
  • Figure 4: Examples of bug localization with BAP on two evaluation set samples. We visualize the line-level weights from BAP above such that lines highlighted in a darker color have higher weights. BAP correctly identifies bug locations at Top-1.
  • Figure 5: Top-1 accuracy versus context length, measured by lines of code (LOC) on Defects4J. We compare BAP-Llama3.2-11B against models at least six times larger.