Table of Contents
Fetching ...

Aligning the Objective of LLM-based Program Repair

Junjielong Xu, Ying Fu, Shin Hwei Tan, Pinjia He

TL;DR

This work reframes automated program repair (APR) for decoder-only LLMs by aligning the model's output with its pre-training objective and abandoning the traditional, locality-focused patch workflow. The authors introduce D4C, a prompting framework that completes entire functions using artifacts (descriptions, failed tests, and error messages) to guide repair without requiring perfect fault localization. Empirical results on Def defects4J and DebugBench show D4C achieving state-of-the-art repair rates (roughly 10% higher than baselines with perfect fault localization) while requiring far fewer patch samples, illustrating both higher accuracy and efficiency. The findings advocate a new mindset for leveraging LLMs in APR, emphasizing objective alignment and holistic code refinement for practical, scalable repair.

Abstract

Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs can locate and repair bugs in certain functions using the related artifacts (e.g., test cases), existing methods still depend on statement-level fault localization methods to provide a list of buggy hunks for repair. This restriction hinders LLMs from exploring potential patches beyond the given locations. In this paper, we investigate a new approach to adapt LLMs to program repair. Our core insight is that LLM's APR capability can be greatly improved by simply aligning the output to their training objective and allowing them to refine the whole program without first identifying faulty statements. Based on this insight, we designed D4C, a straightforward prompting framework for APR. D4C can repair 180 bugs correctly in Defects4J, with each patch being sampled only 10 times. This surpasses the SOTA APR methods with perfect fault localization by 10% and reduces the patch sampling number by 90%. Our findings reveal that (1) objective alignment is crucial for fully exploiting LLM's pre-trained capability, and (2) replacing the traditional localize-buggy-hunks-then-repair workflow with direct debugging is more effective for LLM-based APR methods. Thus, we believe this paper introduces a new mindset for harnessing LLMs in APR.

Aligning the Objective of LLM-based Program Repair

TL;DR

This work reframes automated program repair (APR) for decoder-only LLMs by aligning the model's output with its pre-training objective and abandoning the traditional, locality-focused patch workflow. The authors introduce D4C, a prompting framework that completes entire functions using artifacts (descriptions, failed tests, and error messages) to guide repair without requiring perfect fault localization. Empirical results on Def defects4J and DebugBench show D4C achieving state-of-the-art repair rates (roughly 10% higher than baselines with perfect fault localization) while requiring far fewer patch samples, illustrating both higher accuracy and efficiency. The findings advocate a new mindset for leveraging LLMs in APR, emphasizing objective alignment and holistic code refinement for practical, scalable repair.

Abstract

Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs can locate and repair bugs in certain functions using the related artifacts (e.g., test cases), existing methods still depend on statement-level fault localization methods to provide a list of buggy hunks for repair. This restriction hinders LLMs from exploring potential patches beyond the given locations. In this paper, we investigate a new approach to adapt LLMs to program repair. Our core insight is that LLM's APR capability can be greatly improved by simply aligning the output to their training objective and allowing them to refine the whole program without first identifying faulty statements. Based on this insight, we designed D4C, a straightforward prompting framework for APR. D4C can repair 180 bugs correctly in Defects4J, with each patch being sampled only 10 times. This surpasses the SOTA APR methods with perfect fault localization by 10% and reduces the patch sampling number by 90%. Our findings reveal that (1) objective alignment is crucial for fully exploiting LLM's pre-trained capability, and (2) replacing the traditional localize-buggy-hunks-then-repair workflow with direct debugging is more effective for LLM-based APR methods. Thus, we believe this paper introduces a new mindset for harnessing LLMs in APR.
Paper Structure (29 sections, 5 equations, 8 figures, 5 tables)

This paper contains 29 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: First row: LLM sturctures and their training objectives. Second row: The training and inference objective is misaligned when using decoder-only LLMs for infilling-style APR. Third row: An intuitive way to align the gap: using LLMs for entire program completion rather than masked span prediction.
  • Figure 2: An example of automated program repair workflow.
  • Figure 3: An example of two different APR paradigms. First row: Locate and repair buggy hunkssubsequently may cause many invalid attempts for patch generation. Second row: Locate and repair buggy hunkssimultaneously can mitigate the cost of patching at specific hunks. (The wavy line is a buggy hunk)
  • Figure 4: The workflow of D4C. It uses the buggy code and its corresponding documents, failed tests, and test info ( e.g., error message) to construct the prompt for one-shot prompting-based program repair without a specific buggy hunk (usually provided by statement-level FL tools).
  • Figure 5: The prompt structure of D4C. The details of the code are omitted. The example pair is fixed, which is used to constrain LLM's response format.
  • ...and 3 more figures