Table of Contents
Fetching ...

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, Siheng Chen

TL;DR

SWE-Dev introduces a large-scale, repository-grounded dataset for autonomous end-to-end feature-driven software development, pairing each task with a runnable environment and executable unit tests to enable verifiable supervision. The work details a three-stage dataset construction that uses call-tree tracing to generate controlled, feature-level tasks across real-world codebases, along with PRD refinement to ensure actionable requirements. Through extensive experiments across base LLMs, reasoning models, MAS, and tool-augmented agents, SWE-Dev reveals substantial headroom on Hard tasks and demonstrates meaningful gains from SFT, RL, and MAS training with execution feedback. The dataset thus provides a pragmatic platform for evaluating and advancing long-horizon, execution-aware AI systems for real-world software engineering.

Abstract

Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks. However, feature-driven development, a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world end-to-end feature-driven software development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. We evaluated SWE-Dev across 17 base LLMs, 10 reasoning-focused LLMs, 10 multi-agent systems, and 8 tool-augmented LLM agents. Results show substantial headroom: the best single-turn model reaches only 22.51\% Pass@1 on the hard split, while OpenHands agents improve to 56.44\% but still leave many tasks unsolved. Code is available here https://github.com/DorothyDUUU/SWE-Dev.

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

TL;DR

SWE-Dev introduces a large-scale, repository-grounded dataset for autonomous end-to-end feature-driven software development, pairing each task with a runnable environment and executable unit tests to enable verifiable supervision. The work details a three-stage dataset construction that uses call-tree tracing to generate controlled, feature-level tasks across real-world codebases, along with PRD refinement to ensure actionable requirements. Through extensive experiments across base LLMs, reasoning models, MAS, and tool-augmented agents, SWE-Dev reveals substantial headroom on Hard tasks and demonstrates meaningful gains from SFT, RL, and MAS training with execution feedback. The dataset thus provides a pragmatic platform for evaluating and advancing long-horizon, execution-aware AI systems for real-world software engineering.

Abstract

Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks. However, feature-driven development, a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world end-to-end feature-driven software development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. We evaluated SWE-Dev across 17 base LLMs, 10 reasoning-focused LLMs, 10 multi-agent systems, and 8 tool-augmented LLM agents. Results show substantial headroom: the best single-turn model reaches only 22.51\% Pass@1 on the hard split, while OpenHands agents improve to 56.44\% but still leave many tasks unsolved. Code is available here https://github.com/DorothyDUUU/SWE-Dev.

Paper Structure

This paper contains 54 sections, 1 equation, 30 figures, 11 tables.

Figures (30)

  • Figure 1: Comparison of Pass@1 on SWE-Dev across LLMs, and reasoning-focused LLMs, multi-agent systems (MAS), and tool-augmented LLM agents (via OpenHands), evaluated on the Easy and Hard splits. Each method is shown with paired bars (light: Easy; dark: Hard), highlighting substantial headroom on the Hard split and the strong gains brought by agentic execution-feedback workflows. See Appendix \ref{['app:single_infer']} for full results and full model name.
  • Figure 2: Overview of SWE-Dev, a software development dataset providing feature development tasks with feature description and codebase as input and test cases for evaluation. It is uniquely grounded in real-world repositories and paired with executable test suites, enabling reliable, functionally verifiable supervision. SWE-Dev is evaluated on 45 autonomous coding systems and supports advanced training paradigms like SFT, RL, and multi-agent training.
  • Figure 3: Overview of SWE-Dev dataset construction. Step 1: We collect real-world repositories with passing test files in Dockerized environments, Step 2: trace test executions to construct function-level call trees linking test cases to invoked source code, and Step 3: mask core functions while generating refined PRDs to create tasks. Each sample includes an incomplete repository, a natural language requirement, and executable test cases-enabling realistic, verifiable feature development.
  • Figure 4: Training data scaling of SFT Qwen2.5-7B-instruct on SWE-Dev. As data size increases, performance improves steadily under SFT.
  • Figure 5: Complexity analysis
  • ...and 25 more figures