Table of Contents
Fetching ...

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Lianghong Guo, Yanlin Wang, Caihua Li, Wei Tao, Pengyu Yang, Jiachi Chen, Haoyu Song, Duyu Tang, Zibin Zheng

TL;DR

SWE-Factory automates the construction of GitHub issue-resolution benchmarks to overcome labor-intensive environment setup, grading, and validation. The pipeline uses SWE-Builder, a four-agent system with an evaluation-environment memory pool, and an exit-code-based grading and fail2pass validation strategy. In experiments on 671 issues across Python, Java, JavaScript, and TypeScript, GPT-4.1-mini yielded 269 valid instances at $0.045 per instance, while Gemini-2.5-flash achieved similar results at $0.024; exit-code grading matched manual inspection with 100% accuracy, and fail2pass precision-recall reached 0.92–1.00. The authors release the code and datasets openly to accelerate scalable benchmark generation for LLM-enabled software engineering research.

Abstract

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

TL;DR

SWE-Factory automates the construction of GitHub issue-resolution benchmarks to overcome labor-intensive environment setup, grading, and validation. The pipeline uses SWE-Builder, a four-agent system with an evaluation-environment memory pool, and an exit-code-based grading and fail2pass validation strategy. In experiments on 671 issues across Python, Java, JavaScript, and TypeScript, GPT-4.1-mini yielded 269 valid instances at 0.024; exit-code grading matched manual inspection with 100% accuracy, and fail2pass precision-recall reached 0.92–1.00. The authors release the code and datasets openly to accelerate scalable benchmark generation for LLM-enabled software engineering research.

Abstract

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

Paper Structure

This paper contains 21 sections, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Traditional pipeline of GitHub issue resolution data collection.
  • Figure 2: Framework overview of SWE-Builder.
  • Figure 3: Case study exit-code-based grading.
  • Figure 4: Case study of error2pass phenomenon: python-attrs__attrs-830.
  • Figure 5: Import dependency of the test path on the gold patch in python-attrs__attrs-830.