Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair
Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, Martin Hirzel
TL;DR
This work investigates whether test overfitting—a phenomenon where repaired code passes white-box tests but fails hidden black-box tests—persists in modern LLM‑driven automated program repair (APR) workflows. It uses repository‑level SWE‑bench tasks with Agentless for code patches and e‑Otter++ for reproduction tests, then applies a test‑based refinement loop guided by an LLM critic to study overfitting under various settings. The findings show that overfitting remains a risk (roughly 22–34% in initial settings, rising with refinement), and that exposing golden black-box tests offers only modest gains in issue resolution while still elevating overfitting risk; these results call for careful design of evaluation and mitigation strategies in APR pipelines. The work highlights the tradeoffs in using tests to guide patch generation and motivates future methods to detect and reduce overfitting in LLМ‑assisted software repair, with potential impact on reliability and safety of automated repair systems.
Abstract
Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.
