Table of Contents
Fetching ...

Automated Benchmark Generation for Repository-Level Coding Tasks

Konstantinos Vergopoulos, Mark Niklas Müller, Martin Vechev

TL;DR

This work introduces SetupAgent, an automated system that builds historically accurate repository-level execution environments to generate large, up-to-date code-generation benchmarks. By separating extraction, iterative improvement, and validation phases, it automatically derives installation and test commands from repository context and standardizes evaluation with test-level results. The authors create SWA-Bench and SWEE-Bench to address SWE-Bench’s limitations, revealing notable distributional differences and contamination risks that affect code-agent performance. The results show substantial gains in benchmark diversity and freshness, underscoring the practical value of automated benchmark generation for advancing robust code-agent research.

Abstract

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench. This benchmark challenges code agents to generate patches addressing GitHub issues given the full repository as context. The correctness of generated patches is then evaluated by executing a human-written test suite extracted from the repository after the issue's resolution. However, constructing benchmarks like SWE-Bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. Considering so few repositories, selected for their popularity runs the risk of leading to a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios potentially misguiding development efforts. In this work, we address this challenge and introduce SetUpAgent, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SetUpAgent, we generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 40% lower agent success rates.

Automated Benchmark Generation for Repository-Level Coding Tasks

TL;DR

This work introduces SetupAgent, an automated system that builds historically accurate repository-level execution environments to generate large, up-to-date code-generation benchmarks. By separating extraction, iterative improvement, and validation phases, it automatically derives installation and test commands from repository context and standardizes evaluation with test-level results. The authors create SWA-Bench and SWEE-Bench to address SWE-Bench’s limitations, revealing notable distributional differences and contamination risks that affect code-agent performance. The results show substantial gains in benchmark diversity and freshness, underscoring the practical value of automated benchmark generation for advancing robust code-agent research.

Abstract

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench. This benchmark challenges code agents to generate patches addressing GitHub issues given the full repository as context. The correctness of generated patches is then evaluated by executing a human-written test suite extracted from the repository after the issue's resolution. However, constructing benchmarks like SWE-Bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. Considering so few repositories, selected for their popularity runs the risk of leading to a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios potentially misguiding development efforts. In this work, we address this challenge and introduce SetUpAgent, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SetUpAgent, we generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 40% lower agent success rates.

Paper Structure

This paper contains 46 sections, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Overview of SetupAgent where a Xygp -icon represents an LLM driven step and a Xygp -icon represents execution feedback.
  • Figure 2: Illustration of the extraction phase of SetupAgent. Please see \ref{['sec:app_prompts']} for the full-length prompts.
  • Figure 3: Illustration of the iterative improvement phase of SetupAgent, where the error message was obtained by executing the commands from the previous iteration.
  • Figure 4: Illustration of the first step in the Validation phase.
  • Figure 5: PDFs (left and middle) and CDF (right) of PR creation dates (left), repository age at PR creation time (middle), and number of GitHub stars (right) for SWA, SWEE, and SWE-Bench.
  • ...and 13 more figures