Table of Contents
Fetching ...

SWE-Next: Scalable Real-World Software Engineering Tasks for Agents

Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, Wenhu Chen

Abstract

Executable software engineering data is valuable for training SWE agents, but scaling it remains difficult for two reasons: only a small fraction of real repository changes yield verifiable, high-signal task instances, and naively building repository-specific environments quickly becomes the dominant systems cost. We present SWE-Next, an execution-grounded framework for scalable SWE task and trajectory collection. On the data side, SWE-Next mines real merged pull requests, executes candidate base/merged commit pairs, and retains only those that produce strict test improvements without regressions, yielding self-verifying instances. It also applies strict submission gating so that collected trajectories remain evidence-driven rather than speculative. On the systems side, SWE-Next introduces reusable repo-quarter profiles, which reuse the same environment across nearby commits in time while keeping each task run separate and reproducible. Using only 30 hours and 639GB of environment storage, SWE-Next processes 3,971 seed repositories and 102,582 candidate commit pairs mined from real merged PRs to construct a dataset of 2,308 self-verifying instances. Experiments show that SWE-Next improves downstream pass@1 with fewer or comparable training trajectories, indicating that its gains come not from a stronger trajectory generator, but from higher-signal execution-grounded supervision and more efficient data collection.

SWE-Next: Scalable Real-World Software Engineering Tasks for Agents

Abstract

Executable software engineering data is valuable for training SWE agents, but scaling it remains difficult for two reasons: only a small fraction of real repository changes yield verifiable, high-signal task instances, and naively building repository-specific environments quickly becomes the dominant systems cost. We present SWE-Next, an execution-grounded framework for scalable SWE task and trajectory collection. On the data side, SWE-Next mines real merged pull requests, executes candidate base/merged commit pairs, and retains only those that produce strict test improvements without regressions, yielding self-verifying instances. It also applies strict submission gating so that collected trajectories remain evidence-driven rather than speculative. On the systems side, SWE-Next introduces reusable repo-quarter profiles, which reuse the same environment across nearby commits in time while keeping each task run separate and reproducible. Using only 30 hours and 639GB of environment storage, SWE-Next processes 3,971 seed repositories and 102,582 candidate commit pairs mined from real merged PRs to construct a dataset of 2,308 self-verifying instances. Experiments show that SWE-Next improves downstream pass@1 with fewer or comparable training trajectories, indicating that its gains come not from a stronger trajectory generator, but from higher-signal execution-grounded supervision and more efficient data collection.
Paper Structure (27 sections, 2 equations, 3 figures, 5 tables)

This paper contains 27 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: End-to-end overview of SWE-Next. Starting from real repositories and merged pull requests, we execute tests on base/merged commit pairs to retain verifiable instances, amortize environment setup via reusable quarter profiles, and package validated tasks for trajectory collection and post-training.
  • Figure 2: Quarter profile mechanism for reusable environments. We map each instance to a deterministic (repo, quarter) profile, build a shared quarter-env image that caches a virtual environment (without bundling repository source), and run per-commit repo checkouts in isolated workspaces while reusing the shared environment. When the quarter-env image build fails, we fall back to per-commit environment builds.
  • Figure 3: Domain composition of the 3,971 seeded repositories, inferred from GitHub repository topics and descriptions. The corpus spans diverse software domains.