Automated Program Repair: Emerging trends pose and expose problems for benchmarks
Joseph Renzullo, Pemma Reiter, Westley Weimer, Stephanie Forrest
TL;DR
This paper examines how machine learning, especially large language models, has transformed Automated Program Repair (APR) and how benchmarks have struggled to remain valid in this ML-centric era. It documents data leakage, contamination, and memorization as core threats to generalization, and shows that APR benchmarks differ fundamentally from typical ML datasets in scale, reproducibility, and evaluation protocols. The authors propose standardization efforts, including StandUp4NPR for evaluation and APR-COMP 2024 to unify benchmarks and track progress. Together, these insights underscore the need for rigorous, comparable, and reproducible evaluations to ensure ML-based APR methods translate into reliable, real-world software repairs.
Abstract
Machine learning (ML) now pervades the field of Automated Program Repair (APR). Algorithms deploy neural machine translation and large language models (LLMs) to generate software patches, among other tasks. But, there are important differences between these applications of ML and earlier work. Evaluations and comparisons must take care to ensure that results are valid and likely to generalize. A challenge is that the most popular APR evaluation benchmarks were not designed with ML techniques in mind. This is especially true for LLMs, whose large and often poorly-disclosed training datasets may include problems on which they are evaluated.
