Table of Contents
Fetching ...

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

Zhi Chen, Lingxiao Jiang

TL;DR

This paper evaluates 4,892 patches generated by 10 Software Development Agents on 500 real-world SWE-Bench Verified GitHub issues to understand patch quality and broader code impacts. It analyzes patch patterns across issue, file, function, and line levels, and assesses non-functional code quality metrics (reliability, security, maintainability) using SonarQube, finding that most agent patches avoid introducing bugs or vulnerabilities while reducing code smells and duplication; however, some agents slightly increase code complexity. A comparative GitHub issue analysis shows that resolved issues tend to involve simpler problem statements and smaller code changes, suggesting that decomposing complex tasks could improve agent effectiveness. The study highlights limitations in SWE-Bench test coverage for agent diversity and provides practical guidance for benchmark maintainers, AI developers, and users to enhance real-world applicability of agent-driven software development.

Abstract

In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and interacting with external environments, offer promising solutions to complex software engineering tasks. However, while much research has evaluated code generated by large language models (LLMs), comprehensive studies on agent-generated patches, particularly in real-world settings, are lacking. This study addresses that gap by evaluating 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues from SWE-Bench Verified, focusing on their impact on code quality. Our analysis shows no single agent dominated, with 170 issues unresolved, indicating room for improvement. Even for patches that passed unit tests and resolved issues, agents made different file and function modifications compared to the gold patches from repository developers, revealing limitations in the benchmark's test case coverage. Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities; while some agents increased code complexity, many reduced code duplication and minimized code smells. Finally, agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve effectiveness. This study provides the first comprehensive evaluation of agent-generated patches on real-world GitHub issues, offering insights to advance AI-driven software development.

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

TL;DR

This paper evaluates 4,892 patches generated by 10 Software Development Agents on 500 real-world SWE-Bench Verified GitHub issues to understand patch quality and broader code impacts. It analyzes patch patterns across issue, file, function, and line levels, and assesses non-functional code quality metrics (reliability, security, maintainability) using SonarQube, finding that most agent patches avoid introducing bugs or vulnerabilities while reducing code smells and duplication; however, some agents slightly increase code complexity. A comparative GitHub issue analysis shows that resolved issues tend to involve simpler problem statements and smaller code changes, suggesting that decomposing complex tasks could improve agent effectiveness. The study highlights limitations in SWE-Bench test coverage for agent diversity and provides practical guidance for benchmark maintainers, AI developers, and users to enhance real-world applicability of agent-driven software development.

Abstract

In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and interacting with external environments, offer promising solutions to complex software engineering tasks. However, while much research has evaluated code generated by large language models (LLMs), comprehensive studies on agent-generated patches, particularly in real-world settings, are lacking. This study addresses that gap by evaluating 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues from SWE-Bench Verified, focusing on their impact on code quality. Our analysis shows no single agent dominated, with 170 issues unresolved, indicating room for improvement. Even for patches that passed unit tests and resolved issues, agents made different file and function modifications compared to the gold patches from repository developers, revealing limitations in the benchmark's test case coverage. Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities; while some agents increased code complexity, many reduced code duplication and minimized code smells. Finally, agents performed better on simpler codebases, suggesting that breaking complex tasks into smaller sub-tasks could improve effectiveness. This study provides the first comprehensive evaluation of agent-generated patches on real-world GitHub issues, offering insights to advance AI-driven software development.

Paper Structure

This paper contains 40 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview: we evaluate agent-generated patches on SWE-Bench tasks, analyze their impact on the codebase, and compare resolved and unresolved GitHub issues to gain insights for improving agent performance.
  • Figure 2: Issue Resolution Breakdown: 330 Resolved, 170 Unresolved
  • Figure 3: Overlap of Resolved Issues Between Agents