Table of Contents
Fetching ...

From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

Nikolai Ludwig, Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg

Abstract

We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.

From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

Abstract

We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.

Paper Structure

This paper contains 38 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Performance overview of various open-source foundational models and SWE agents on SWE-bench Verified. Our models, SWE-Hero agents establish a new frontier, outperforming same-scale competitors.
  • Figure 2: Comparative overview of Agentic Workflows in OpenHands. The bottom panel depicts the SWE-Zero setup, utilizing a restricted, execution-free environment to maximize data scalability. In contrast, the top panel illustrates the SWE-Hero configuration, which transitions to a standard, feedback-driven workflow grounded in physical execution.
  • Figure 3: Cumulative performance and resource efficiency on SWE-bench Verified. Tasks are ordered by SWE-Hero-32b turn-count (complexity proxy). The primary axis (thick lines) shows the natural decline in resolve rate as complexity grows, while the secondary axis (faint lines) highlights the significantly lower turn-costs of SWE-Zero models.
  • Figure 4: Scaling behavior of SWE-Zero-14B. The resolution rate on SWE-bench increases consistently as execution-free training samples scale from 4k to 150k.
  • Figure 5: Increasing inference-time compute improves performance on SWE-Bench Verified using open-source SWE-Lego-Verifier-8B tao2026swe.
  • ...and 2 more figures