When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

Yuxiao Chen; Jingzheng Wu; Xiang Ling; Changjiang Li; Zhiqing Rui; Tianyue Luo; Yanjun Wu

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

Yuxiao Chen, Jingzheng Wu, Xiang Ling, Changjiang Li, Zhiqing Rui, Tianyue Luo, Yanjun Wu

TL;DR

The paper addresses repository-level automatic program repair by showing that function-level context is insufficient for many real-world bugs. It introduces RepoBugs, a 124-bug benchmark of repository-level interface inconsistencies, and RLCE, a context extraction and prompt-generation framework that leverages repository structure and error localization to craft precise prompts for LLMs. Across GPT-3.5, PaLM2, and GPT-4, RLCE yields substantial repair-rate gains over function-level baselines (often exceeding 100%, with GPT-3.5 up to 160%), and it analyzes how context sources and prompting strategies affect outcomes. The work provides practical guidance for repository-level APR and supplies a replicable dataset and methodology to guide future research and tool development.

Abstract

In recent years, large language models (LLMs) have demonstrated substantial potential in addressing automatic program repair (APR) tasks. However, the current evaluation of these models for APR tasks focuses solely on the limited context of the single function or file where the bug is located, overlooking the valuable information in the repository-level context. This paper investigates the performance of popular LLMs in handling repository-level repair tasks. We introduce RepoBugs, a new benchmark comprising 124 typical repository-level bugs from open-source repositories. Preliminary experiments using GPT3.5 based on the function where the error is located, reveal that the repair rate on RepoBugs is only 22.58%, significantly diverging from the performance of GPT3.5 on function-level bugs in related studies. This underscores the importance of providing repository-level context when addressing bugs at this level. However, the repository-level context offered by the preliminary method often proves redundant and imprecise and easily exceeds the prompt length limit of LLMs. To solve the problem, we propose a simple and universal repository-level context extraction method (RLCE) designed to provide more precise context for repository-level code repair tasks. Evaluations of three mainstream LLMs show that RLCE significantly enhances the ability to repair repository-level bugs. The improvement reaches a maximum of 160% compared to the preliminary method. Additionally, we conduct a comprehensive analysis of the effectiveness and limitations of RLCE, along with the capacity of LLMs to address repository-level bugs, offering valuable insights for future research.

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

TL;DR

Abstract

Paper Structure (28 sections, 9 figures, 3 tables)

This paper contains 28 sections, 9 figures, 3 tables.

Introduction
Benchmark Construction
Dataset Collection
Dataset Generation
Proposed Framework
Overall Framework
Context Retriever
Prompt Composer
Evaluation Setup
Models
Prompt Generation
Compared Repair Baselines
Evaluation Metrics
Results and Analysis
RLCE Outperforms the Baselines
...and 13 more sections

Figures (9)

Figure 1: A simple example of using ChatGPT to handle repository-level bugs. The bug type is interface inconsistency. The number of parameter lists calling the power function is inconsistent with the function definition in utils.py. The left figure represents the reply to ChatGPT providing the function where the bug is located as context. The right figure represents the reply after adding the repository context.
Figure 2: The overview of our repository-level context extraction method (RLCE).
Figure 3: Statistical categorization of repair outcomes using the detail and CoT methods for three models, along with CoT-generated explanations. The horizontal axis comprises four categories, each composed of two uppercase letters, T or F, representing whether the repair results using the Detail and CoT methods are correct (e.g., FT denotes all cases where Detail repair is incorrect and CoT repair is correct). The vertical axis represents the number of cases for each category. In each category, red indicates instances where the CoT method provided incorrect explanations, while green signifies correct explanations.
Figure 4: The results of various prompt strategies employed by GPT-3.5 and GPT-4 are statistically compiled in terms of correction accuracy based on error categories. The horizontal axis denotes the six categories of errors, with specific definitions provided in Section 2.2. The total number of samples corresponding to each category is enclosed in parentheses. Different colors of bars represent distinct methods, while the vertical axis represents the repair accuracy.
Figure 5: The relationship between different prompt lengths and repair accuracy is depicted through six subfigures. Each subfigure represents a distinct prompt strategy. All cases within each prompt strategy are evenly divided into four subsets based on prompt length (with a total dataset size of 124, resulting in 31 cases per subset). In each subfigure, bars represent the average length of tokens in the prompt, while the line graph illustrates the repair accuracy.
...and 4 more figures

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

TL;DR

Abstract

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)