Table of Contents
Fetching ...

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Minh V. T. Thai, Tue Le, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui

TL;DR

The paper introduces SWE-EVO, a long-horizon software-evolution benchmark that challenges coding agents to interpret release notes, plan multi-file changes across versioned codebases, and maintain regression-free functionality validated by large test suites. It constructs 48 realistic evolution tasks from seven Python projects, emphasizing sustained planning and cross-file coordination beyond single-issue fixes. Experimental results show a substantial gap between current agents and real-world evolution demands, with GPT-5 solving only about 20% of tasks even with PR/issue context, and reveal informative failure modes that differentiate model families. The study also introduces Fix Rate as a granular progress metric and provides a robust evaluation framework to guide future advances in autonomous software engineering.

Abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

TL;DR

The paper introduces SWE-EVO, a long-horizon software-evolution benchmark that challenges coding agents to interpret release notes, plan multi-file changes across versioned codebases, and maintain regression-free functionality validated by large test suites. It constructs 48 realistic evolution tasks from seven Python projects, emphasizing sustained planning and cross-file coordination beyond single-issue fixes. Experimental results show a substantial gap between current agents and real-world evolution demands, with GPT-5 solving only about 20% of tasks even with PR/issue context, and reveal informative failure modes that differentiate model families. The study also introduces Fix Rate as a granular progress metric and provides a robust evaluation framework to guide future advances in autonomous software engineering.

Abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.

Paper Structure

This paper contains 28 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Conceptual model of software evolution, depicting the iterative cycle from an existing system to a new system through change identification, impact analysis, and the evolution process.
  • Figure 1: Average and maximum numbers characterizing different attributes of a SWE-EVO task instance. Statistics are micro-averages calculated without grouping by repository.
  • Figure 2: Comparison of SWE-Bench and SWE-EVO. SWE-Bench tasks agents with resolving a single GitHub issue to produce one patch (pull request). In contrast, SWE-EVO requires agents to interpret release notes and implement comprehensive changes that resolve multiple PRs, develop new features, and fix multiple bugs to evolve the codebase to a new version.
  • Figure 3: Distribution of SWE-EVO tasks (in parenthesis) across 7 open source GitHub repositories that each contains the source code for a popular, widely downloaded PyPI package.
  • Figure 4: SWE-EVO is markedly more demanding than SWE-Bench, with longer issue descriptions, broader patch scope (more lines/files/functions edited), and heavier test suites (FAIL_TO_PASS and total) in both mean and max. These properties require agents capable of long-context reasoning, coordinated multi-file edits, and regression-safe fixes.
  • ...and 3 more figures