Evaluating the Generalizability of LLMs in Automated Program Repair

Fengjie Li; Jiajun Jiang; Jiajun Sun; Hongyu Zhang

Evaluating the Generalizability of LLMs in Automated Program Repair

Fengjie Li, Jiajun Jiang, Jiajun Sun, Hongyu Zhang

TL;DR

This work investigates the generalizability of 11 top-performing LLMs in automated program repair (APR) by evaluating them on Defects4J and a semantically equivalent transformed dataset, Defects4J-Trans. The authors demonstrate substantial generalization gaps, with average declines of $49.48\%$ in correct patches and $42.90\%$ in plausible patches on the transformed data, highlighting limits of current LLM-based APR. They further test three repair-relevant information prompts and find improvements up to $136.67\%$ for correct patches and $121.82\%$ for plausible patches, yet most models still underperform relative to their Defects4J results, indicating prompt engineering alone is insufficient. The study provides directions for future work, such as code normalization and integrating traditional APR techniques with LLMs, and open-sources all data to facilitate replication and broader evaluation.

Abstract

LLM-based automated program repair methods have attracted significant attention for their state-of-the-art performance. However, they were primarily evaluated on a few well known datasets like Defects4J, raising questions about their effectiveness on new datasets. In this study, we evaluate 11 top-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming Defects4J while maintaining the original semantics. Results from experiments on both Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited generalizability in APR tasks, with the average number of correct and plausible patches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS. Further investigation into incorporating additional repair-relevant information in repair prompts reveals that, although this information significantly enhances the LLMs' capabilities (increasing the number of correct and plausible patches by up to 136.67% and 121.82%, respectively), performance still falls short of their original results. This indicates that prompt engineering alone is insufficient to substantially enhance LLMs' repair capabilities. Based on our study, we also offer several recommendations for future research.

Evaluating the Generalizability of LLMs in Automated Program Repair

TL;DR

Abstract

Evaluating the Generalizability of LLMs in Automated Program Repair

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)