Table of Contents
Fetching ...

Exploring Generalizable Automated Program Repair with Large Language Models

Viola Campos, Ridwan Shariffdeen, Adrian Ulges, Yannic Noller

TL;DR

This work presents an intensive empirical evaluation of LLMs'capabilities in APR and explores language-agnostic repair by utilizing benchmarks for Java, JavaScript, Python, and PHP and investigates the effects of fault localization.

Abstract

Automated Program Repair (APR) proposes bug fixes to aid developers in maintaining software. The state of the art in this domain focuses on LLMs, leveraging their strong capabilities to comprehend specifications in natural language and to generate program code. However, despite the APR community's research achievements and industry deployments, APR still cannot generalize broadly. In this work, we present an intensive empirical evaluation of LLMs' capabilities in APR. We evaluate a diverse set of 13 recent open and closed models. In particular, we explore language-agnostic repair by utilizing benchmarks for Java, JavaScript, Python, and PHP. Besides the generalization across languages and levels of patch complexity, we also investigate the effects of fault localization (FL). Our key results include: (1) Different LLMs tend to perform best for different languages, which makes it hard to develop cross-platform, single-LLM repair techniques. (2) Combining models by pooling repairs adds value with respect to uniquely fixed bugs, so a committee of expert models should be considered. (3) Under realistic assumptions of imperfect FL, we observe significant drops in accuracy from the usual practice of using perfect FL. Our insights will help develop reliable and generalizable APR techniques and evaluate them in realistic and fair environments.

Exploring Generalizable Automated Program Repair with Large Language Models

TL;DR

This work presents an intensive empirical evaluation of LLMs'capabilities in APR and explores language-agnostic repair by utilizing benchmarks for Java, JavaScript, Python, and PHP and investigates the effects of fault localization.

Abstract

Automated Program Repair (APR) proposes bug fixes to aid developers in maintaining software. The state of the art in this domain focuses on LLMs, leveraging their strong capabilities to comprehend specifications in natural language and to generate program code. However, despite the APR community's research achievements and industry deployments, APR still cannot generalize broadly. In this work, we present an intensive empirical evaluation of LLMs' capabilities in APR. We evaluate a diverse set of 13 recent open and closed models. In particular, we explore language-agnostic repair by utilizing benchmarks for Java, JavaScript, Python, and PHP. Besides the generalization across languages and levels of patch complexity, we also investigate the effects of fault localization (FL). Our key results include: (1) Different LLMs tend to perform best for different languages, which makes it hard to develop cross-platform, single-LLM repair techniques. (2) Combining models by pooling repairs adds value with respect to uniquely fixed bugs, so a committee of expert models should be considered. (3) Under realistic assumptions of imperfect FL, we observe significant drops in accuracy from the usual practice of using perfect FL. Our insights will help develop reliable and generalizable APR techniques and evaluate them in realistic and fair environments.

Paper Structure

This paper contains 35 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Frequency of indentation errors for the Python benchmark. Notably, most models struggle with indentation issues, with the Gemini models being the exception.
  • Figure 2: Bug fixes per model, determined for 400 single-function bugs in four programming languages.
  • Figure 3: Combined $pass@k$ (using the test prompt) for ensembles of the best performing model per benchmark with a second model.
  • Figure 4: The accuracy of program repair (pass@5, base prompt, macro-averaged over all $4$ benchmarks), plotted against models' release dates. For open models (blue), circle size indicates the respective model's size. Open models are catching up to closed models (red), and have even surpassed them with DeepSeek R1 (dist.)'s release.