Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench

Anton Cheshkov; Pavel Zadorozhny; Rodion Levichev; Evgeny Maslov; Ronaldo Franco Jaldin

Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench

Anton Cheshkov, Pavel Zadorozhny, Rodion Levichev, Evgeny Maslov, Ronaldo Franco Jaldin

TL;DR

This work evaluates the potential of conversational patch generation (CPG) for SWE-Bench AIR tasks by leveraging known fault localization and failing-test feedback. Using 92 SWE-Bench Lite problems, the study compares two LLMs (Llama3.1 70B Instruct and GPT-4o-mini) across two experiment modes: 6-round conversations with failure feedback and 30 one-shot attempts without failure feedback. Results show that failure-aware CPG can produce patches passing the public test suite $T$ in up to 62% (Llama3.1) and 56% (GPT-4o-mini), with 47% and 46% also passing the hidden set $T^*$. In the one-shot setup, passing rates are lower overall, but CPG still outperforms repetitive sampling for Llama3.1, indicating meaningful potential for project-level automatic repair and guiding future improvements in fault localization and patch-generation pipelines.

Abstract

Automatic program repair at project level may open yet to be seen opportunities in various fields of human activity. Since the SWE-Bench challenge was presented, we have seen numerous of solutions. Patch generation is a part of program repair, and test suite-based conversational patch generation has proven its effectiveness. However, the potential of conversational patch generation has not yet specifically estimated on SWE-Bench. This study reports experimental results aimed at evaluating the individual effectiveness of conversational patch generation on problems from SWE-Bench. The experiments show that a simple conversational pipeline based on LLaMA 3.1 70B can generate valid patches in 47\% of cases, which is comparable to the state-of-the-art in program repair on SWE-Bench.

Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench

TL;DR

in up to 62% (Llama3.1) and 56% (GPT-4o-mini), with 47% and 46% also passing the hidden set

. In the one-shot setup, passing rates are lower overall, but CPG still outperforms repetitive sampling for Llama3.1, indicating meaningful potential for project-level automatic repair and guiding future improvements in fault localization and patch-generation pipelines.

Abstract

Paper Structure (10 sections, 2 figures, 1 algorithm)

This paper contains 10 sections, 2 figures, 1 algorithm.

Introduction
Methodology
Results & Discussion
Related Work
Conclusions
List of SWE-Bench problems in the experiment
Prompt Templates
Prompt template A
Prompt template B
Prompt C

Figures (2)

Figure 1: Accumulated percent of valid patches generated during 6 consecutive independent conversations, 5 LLM requests each. Two LLMs: llama3.1 70B Instruct and gpt4o-mini. Two different patch validation sets $T$ and $T \cup T^*$.
Figure 2: Accumulated percent of valid patches generated during 30 repetitive independent patch generations. Two LLMs: llama3.1 70B Instruct and gpt4o-mini. Two patch validation sets $T$ and $T \cup T^*$.

Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench

TL;DR

Abstract

Exploring the Potential of Conversational Test Suite Based Program Repair on SWE-bench

Authors

TL;DR

Abstract

Table of Contents

Figures (2)