Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench
Anton Cheshkov, Pavel Zadorozhny, Rodion Levichev, Evgeny Maslov, Ronaldo Franco Jaldin
TL;DR
This work evaluates the potential of conversational patch generation (CPG) for SWE-Bench AIR tasks by leveraging known fault localization and failing-test feedback. Using 92 SWE-Bench Lite problems, the study compares two LLMs (Llama3.1 70B Instruct and GPT-4o-mini) across two experiment modes: 6-round conversations with failure feedback and 30 one-shot attempts without failure feedback. Results show that failure-aware CPG can produce patches passing the public test suite $T$ in up to 62% (Llama3.1) and 56% (GPT-4o-mini), with 47% and 46% also passing the hidden set $T^*$. In the one-shot setup, passing rates are lower overall, but CPG still outperforms repetitive sampling for Llama3.1, indicating meaningful potential for project-level automatic repair and guiding future improvements in fault localization and patch-generation pipelines.
Abstract
Automatic program repair at project level may open yet to be seen opportunities in various fields of human activity. Since the SWE-Bench challenge was presented, we have seen numerous of solutions. Patch generation is a part of program repair, and test suite-based conversational patch generation has proven its effectiveness. However, the potential of conversational patch generation has not yet specifically estimated on SWE-Bench. This study reports experimental results aimed at evaluating the individual effectiveness of conversational patch generation on problems from SWE-Bench. The experiments show that a simple conversational pipeline based on LLaMA 3.1 70B can generate valid patches in 47\% of cases, which is comparable to the state-of-the-art in program repair on SWE-Bench.
