Table of Contents
Fetching ...

Heterogeneous Prompting and Execution Feedback for SWE Issue Test Generation and Selection

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, Martin Hirzel

TL;DR

This work tackles generating reproduction tests from SWE issues when resolving code patches are unavailable. It introduces e-Otter++, which augments Otter++ with execution feedback, issue description morphs, and test-generation context masks, and leverages surrogate code patches to guide test selection. Empirical results on SWT-bench Lite, TDD-Bench Verified, and SWE-rebench show substantial improvements, achieving a new state-of-the-art $F\!\to\!\!P$ rate (63.0% on TDD-Bench Verified) and demonstrating that heterogeneous prompting and execution-informed repair are key drivers. The approach also enhances SWE-patch validation and supports potential integration with real-world SWE agents, advancing TDD-driven quality assurance in software engineering.

Abstract

A software engineering issue (SWE issue) is easier to resolve when accompanied by a reproduction test. Unfortunately, most issues do not come with functioning reproduction tests, so this paper explores how to generate them automatically. The primary challenge in this setting is that the code to be tested is either missing or wrong, as evidenced by the existence of the issue in the first place. This has held back test generation for this setting: without the correct code to execute, it is difficult to leverage execution feedback to generate good tests. This paper introduces novel techniques for leveraging execution feedback to get around this problem, implemented in a new reproduction test generator called e-Otter++. Experiments show that e-Otter++ represents a leap ahead in the state-of-the-art for this problem, generating tests with an average fail-to-pass rate of 63% on the TDD-Bench Verified benchmark.

Heterogeneous Prompting and Execution Feedback for SWE Issue Test Generation and Selection

TL;DR

This work tackles generating reproduction tests from SWE issues when resolving code patches are unavailable. It introduces e-Otter++, which augments Otter++ with execution feedback, issue description morphs, and test-generation context masks, and leverages surrogate code patches to guide test selection. Empirical results on SWT-bench Lite, TDD-Bench Verified, and SWE-rebench show substantial improvements, achieving a new state-of-the-art rate (63.0% on TDD-Bench Verified) and demonstrating that heterogeneous prompting and execution-informed repair are key drivers. The approach also enhances SWE-patch validation and supports potential integration with real-world SWE agents, advancing TDD-driven quality assurance in software engineering.

Abstract

A software engineering issue (SWE issue) is easier to resolve when accompanied by a reproduction test. Unfortunately, most issues do not come with functioning reproduction tests, so this paper explores how to generate them automatically. The primary challenge in this setting is that the code to be tested is either missing or wrong, as evidenced by the existence of the issue in the first place. This has held back test generation for this setting: without the correct code to execute, it is difficult to leverage execution feedback to generate good tests. This paper introduces novel techniques for leveraging execution feedback to get around this problem, implemented in a new reproduction test generator called e-Otter++. Experiments show that e-Otter++ represents a leap ahead in the state-of-the-art for this problem, generating tests with an average fail-to-pass rate of 63% on the TDD-Bench Verified benchmark.

Paper Structure

This paper contains 28 sections, 7 figures, 15 tables, 1 algorithm.

Figures (7)

  • Figure 1: Evaluation harness for bug reproduction test. (1) The issue description and $c_\textrm{old}$ go into e-Otter++ as input, (2) e-Otter++ generates a test patch with a reproduction test $y$, (3) executing the test $y$ on $c_\textrm{old}$ should fail in order to reproduce the issue, (4) a developer-written golden code patch is applied to $c_\textrm{old}$ as a pull request resulting in $c_\textrm{new}$, (6) executing the test $y$ on $c_\textrm{new}$ should pass now in order to confirm that the issue has been addressed by the pull request.
  • Figure 2: Overview of our approach.
  • Figure 3: Issue description and test patch for sympy__sympy-23413. Some lines are omitted from the patch due to space constraints.
  • Figure 4: Number of attempts taken in test repair
  • Figure 5: $F\!\!\to\!\!P$ @ N vs. N: a comparison of tests generated with higher temperature and heterogeneous prompting.
  • ...and 2 more figures