Table of Contents
Fetching ...

HAFix: History-Augmented Large Language Models for Bug Fixing

Yu Shi, Abdul Ali Bangash, Emad Fallahzadeh, Bram Adams, Ahmed E. Hassan

TL;DR

This paper addresses the gap that existing LLM-based bug fixing largely ignores historical repository data. It proposes HAFix, which injects seven blame-commit–derived historical heuristics into bug-fixing prompts, plus HAFix-Agg that aggregates their signals. Across Python BugsInPy and Java Defects4J single-line bugs, HAFix heuristics yield statistically significant improvements over a history-agnostic baseline, with HAFix-Agg achieving average bug-fix rate gains of about 45%–50% and large effect sizes. The study also analyzes three prompt styles, finding Instruction to be the most effective for leveraging historical context, and demonstrates substantial cost savings via early-stopping strategies, offering practical guidance for deploying history-augmented LLM bug fixing in real-world settings.

Abstract

Recent studies have explored the performance of Large Language Models (LLMs) on various Software Engineering (SE) tasks, such as code generation and bug fixing. However, these approaches typically rely on the context data from the current snapshot of the project, overlooking the potential of rich historical data residing in real-world software repositories. Additionally, the impact of prompt styles on LLM performance for SE tasks within a historical context remains underexplored. To address these gaps, we propose HAFix, which stands for History-Augmented LLMs on Bug Fixing, a novel approach that leverages seven individual historical heuristics associated with bugs and aggregates the results of these heuristics (HAFix-Agg) to enhance LLMs' bug-fixing capabilities. To empirically evaluate HAFix, we employ three Code LLMs (i.e., Code Llama, DeepSeek-Coder and DeepSeek-Coder-V2-Lite models) on 51 single-line Python bugs from BugsInPy and 116 single-line Java bugs from Defects4J. Our evaluation demonstrates that multiple HAFix heuristics achieve statistically significant improvements compared to a non-historical baseline inspired by GitHub Copilot. Furthermore, the aggregated HAFix variant HAFix-Agg achieves substantial improvements by combining the complementary strengths of individual heuristics, increasing bug-fixing rates by an average of 45.05% on BugsInPy and 49.92% on Defects4J relative to the corresponding baseline. Moreover, within the context of historical heuristics, we identify the Instruction prompt style as the most effective template compared to the InstructionLabel and InstructionMask for LLMs in bug fixing. Finally, we evaluate the cost of HAFix in terms of inference time and token usage, and provide a pragmatic trade-off analysis of the cost and bug-fixing performance, offering valuable insights for the practical deployment of our approach in real-world scenarios.

HAFix: History-Augmented Large Language Models for Bug Fixing

TL;DR

This paper addresses the gap that existing LLM-based bug fixing largely ignores historical repository data. It proposes HAFix, which injects seven blame-commit–derived historical heuristics into bug-fixing prompts, plus HAFix-Agg that aggregates their signals. Across Python BugsInPy and Java Defects4J single-line bugs, HAFix heuristics yield statistically significant improvements over a history-agnostic baseline, with HAFix-Agg achieving average bug-fix rate gains of about 45%–50% and large effect sizes. The study also analyzes three prompt styles, finding Instruction to be the most effective for leveraging historical context, and demonstrates substantial cost savings via early-stopping strategies, offering practical guidance for deploying history-augmented LLM bug fixing in real-world settings.

Abstract

Recent studies have explored the performance of Large Language Models (LLMs) on various Software Engineering (SE) tasks, such as code generation and bug fixing. However, these approaches typically rely on the context data from the current snapshot of the project, overlooking the potential of rich historical data residing in real-world software repositories. Additionally, the impact of prompt styles on LLM performance for SE tasks within a historical context remains underexplored. To address these gaps, we propose HAFix, which stands for History-Augmented LLMs on Bug Fixing, a novel approach that leverages seven individual historical heuristics associated with bugs and aggregates the results of these heuristics (HAFix-Agg) to enhance LLMs' bug-fixing capabilities. To empirically evaluate HAFix, we employ three Code LLMs (i.e., Code Llama, DeepSeek-Coder and DeepSeek-Coder-V2-Lite models) on 51 single-line Python bugs from BugsInPy and 116 single-line Java bugs from Defects4J. Our evaluation demonstrates that multiple HAFix heuristics achieve statistically significant improvements compared to a non-historical baseline inspired by GitHub Copilot. Furthermore, the aggregated HAFix variant HAFix-Agg achieves substantial improvements by combining the complementary strengths of individual heuristics, increasing bug-fixing rates by an average of 45.05% on BugsInPy and 49.92% on Defects4J relative to the corresponding baseline. Moreover, within the context of historical heuristics, we identify the Instruction prompt style as the most effective template compared to the InstructionLabel and InstructionMask for LLMs in bug fixing. Finally, we evaluate the cost of HAFix in terms of inference time and token usage, and provide a pragmatic trade-off analysis of the cost and bug-fixing performance, offering valuable insights for the practical deployment of our approach in real-world scenarios.
Paper Structure (51 sections, 23 figures, 12 tables)

This paper contains 51 sections, 23 figures, 12 tables.

Figures (23)

  • Figure 1: Dataset collection for HAFix: ① represents the data used for the baseline, while ② to ⑧ represent the data for various historical heuristics. V4 refers to the snapshot of the project version where the bug fix was committed, and V3 is the snapshot of the previous version containing the bug. V2 is the snapshot of the last commit modifying the buggy line in the V4 snapshot, while V1 is the snapshot of the commit preceding V2. The rationale for selecting the blame commit and these historical heuristics are detailed in Section \ref{['HAFix: History-Augmented LLMs for Bug Fixing']}.
  • Figure 2: An example of the bug description we mined from the GitHub issue page.
  • Figure 3: Example of three prompt styles.
  • Figure 4: HAFix architecture and evaluation pipeline.
  • Figure 5: Pass@k (%) comparison of baseline and seven HAFix heuristics for bug-fixing performance across two datasets and three models.
  • ...and 18 more figures