BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen; Fanzhe Meng; Jiale Zhao; Minghao Li; Daixuan Cheng; Huatong Song; Jie Chen; Yuzhi Lin; Hui Chen; Xin Zhao; Ruihua Song; Chang Liu; Cheng Chen; Kai Jia; Ji-Rong Wen

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen

TL;DR

BeyondSWE is introduced, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings and develops SearchSWE, a framework that integrates deep search with coding abilities.

Abstract

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

TL;DR

Abstract

Paper Structure (42 sections, 8 figures, 7 tables)

This paper contains 42 sections, 8 figures, 7 tables.

Introduction
Related Work
BeyondSWE
Benchmark Infrastructure
Benchmark Tasks
Cross-Repository Issue Resolution
Domain-Specific Issue Resolution
Dependency-Driven Migration
Document-to-Repository Generation
Evaluation Protocol Design
Quality Control and Human Verification
The SearchSWE Framework
Experiments
Experimental Setup
Evaluation Results with OpenHands
...and 27 more sections

Figures (8)

Figure 1: Overview of BeyondSWE. Our benchmark extends evaluation along two dimensions—knowledge scope and resolution scope: CrossRepo and DomainFix expand knowledge scope by requiring external software resources and domain expertise respectively; DepMigrate and Doc2Repo expand resolution scope from localized patches to codebase-wide transformations.
Figure 2: Automated environment construction pipeline for BeyondSWE. This pipeline consists of three stages: (1) candidate collection with tailored strategies for each task; (2) agent-based Docker construction where an LLM agent iteratively resolves dependencies within the container until all tests pass; (3) strict environmental inspection requiring consistent P2P/F2P behavior to ensure reproducibility.
Figure 3: Overview of The SearchSWE Framework. Left: the agent solves coding tasks by iteratively accessing external resources (search, browser) and local context (Docker container), with a blocklist preventing cheating. Right: evaluation applies patches to a fresh container and runs P2P/F2P tests for verification.
Figure 3: Research fields and repositories included in the DomainFix task.
Figure 4: Average tool calls (turns) per instance.
...and 3 more figures

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

TL;DR

Abstract

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Authors

TL;DR

Abstract

Table of Contents

Figures (8)