Table of Contents
Fetching ...

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel

TL;DR

The paper tackles the scarcity of real-world interactive software engineering data and the risk of data contamination in benchmarks by introducing SWE-rebench, a fully automated pipeline that harvests executable SWE tasks from GitHub and yields a large, verifiable dataset (over 21k tasks) and a decontaminated benchmark (294 tasks from 169 repos) with a private leaderboard. It demonstrates how continuous data collection and standardized evaluation can support reinforcement learning for SWE agents and enable fair cross-model comparisons, revealing contamination effects in older benchmarks. The work provides a scalable foundation for open-source SWE research, enabling robust training and transparent evaluation while outlining limitations and directions for future expansion to more languages and broader task coverage.

Abstract

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

TL;DR

The paper tackles the scarcity of real-world interactive software engineering data and the risk of data contamination in benchmarks by introducing SWE-rebench, a fully automated pipeline that harvests executable SWE tasks from GitHub and yields a large, verifiable dataset (over 21k tasks) and a decontaminated benchmark (294 tasks from 169 repos) with a private leaderboard. It demonstrates how continuous data collection and standardized evaluation can support reinforcement learning for SWE agents and enable fair cross-model comparisons, revealing contamination effects in older benchmarks. The work provides a scalable foundation for open-source SWE research, enabling robust training and transparent evaluation while outlining limitations and directions for future expansion to more languages and broader task coverage.

Abstract

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

Paper Structure

This paper contains 35 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of the automated pipeline for collecting software engineering data.
  • Figure 2: Overlap of solved tasks across selected models.