Table of Contents
Fetching ...

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen

TL;DR

BeyondSWE is introduced, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings and develops SearchSWE, a framework that integrates deep search with coding abilities.

Abstract

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

TL;DR

BeyondSWE is introduced, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings and develops SearchSWE, a framework that integrates deep search with coding abilities.

Abstract

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
Paper Structure (42 sections, 8 figures, 7 tables)

This paper contains 42 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of BeyondSWE. Our benchmark extends evaluation along two dimensions—knowledge scope and resolution scope: CrossRepo and DomainFix expand knowledge scope by requiring external software resources and domain expertise respectively; DepMigrate and Doc2Repo expand resolution scope from localized patches to codebase-wide transformations.
  • Figure 2: Automated environment construction pipeline for BeyondSWE. This pipeline consists of three stages: (1) candidate collection with tailored strategies for each task; (2) agent-based Docker construction where an LLM agent iteratively resolves dependencies within the container until all tests pass; (3) strict environmental inspection requiring consistent P2P/F2P behavior to ensure reproducibility.
  • Figure 3: Overview of The SearchSWE Framework. Left: the agent solves coding tasks by iteratively accessing external resources (search, browser) and local context (Docker container), with a blocklist preventing cheating. Right: evaluation applies patches to a fresh container and runs P2P/F2P tests for verification.
  • Figure 3: Research fields and repositories included in the DomainFix task.
  • Figure 4: Average tool calls (turns) per instance.
  • ...and 3 more figures