Issue2Test: Generating Reproducing Test Cases from Issue Reports

Noor Nashid; Islem Bouzenia; Michael Pradel; Ali Mesbah

Issue2Test: Generating Reproducing Test Cases from Issue Reports

Noor Nashid, Islem Bouzenia, Michael Pradel, Ali Mesbah

TL;DR

Issue2Test introduces a novel, three-phase LLM-driven workflow to automatically generate reproducing tests that fail for the exact reason described in a GitHub issue. By combining issue comprehension with project-specific meta-prompts, fail-first test generation, and an execution-driven refinement loop, it consistently aligns test failures with the reported problem and iteratively narrows candidate tests. On the SWT-bench-lite benchmark, Issue2Test reproduces 84 of 276 issues (≈30.4% F→P) and uniquely solves 20 issues not addressed by baselines, with cost per issue around $0.0521–$0.66 depending on the model. These results indicate significant potential to automate issue solving and patch validation, and the approach lays groundwork for deeper integration with automated repair pipelines and project-specific testing practices.

Abstract

Automated tools for solving GitHub issues are receiving significant attention by both researchers and practitioners, e.g., in the form of foundation models and LLM-based agents prompted with issues. A crucial step toward successfully solving an issue is creating a test case that accurately reproduces the issue. Such a test case can guide the search for an appropriate patch and help validate whether the patch matches the issue's intent. However, existing techniques for issue reproduction show only moderate success. This paper presents Issue2Test, an LLM-based technique for automatically generating a reproducing test case for a given issue report. Unlike automated regression test generators, which aim at creating passing tests, our approach aims at a test that fails, and that fails specifically for the reason described in the issue. To this end, Issue2Test performs three steps: (1) understand the issue and gather context (e.g., related files and project-specific guidelines) relevant for reproducing it; (2) generate a candidate test case; and (3) iteratively refine the test case based on compilation and runtime feedback until it fails and the failure aligns with the problem described in the issue. We evaluate Issue2Test on the SWT-bench-lite dataset, where it successfully reproduces 32.9% of the issues, achieving a 16.3% relative improvement over the best existing technique. Our evaluation also shows that Issue2Test reproduces 20 issues that four prior techniques fail to address, contributing a total of 60.4% of all issues reproduced by these tools. We envision our approach to contribute to enhancing the overall progress in the important task of automatically solving GitHub issues.

Issue2Test: Generating Reproducing Test Cases from Issue Reports

TL;DR

Abstract

Issue2Test: Generating Reproducing Test Cases from Issue Reports

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)