Table of Contents
Fetching ...

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Giovanni Pinna, Jingzhi Gong, David Williams, Federica Sarro

TL;DR

The study addresses the challenge of comparing AI coding agents under real-world workloads by performing a task-stratified, temporal analysis on 7,156 PRs from the AIDev dataset across five agents. It employs regression-based time trends, LOESS for nonlinearity, and stratified Chi-square/Fisher tests with Bonferroni correction to quantify interactions between agent, task type, and time. Key findings show that task type is the dominant driver of acceptance with a $29$ percentage-point gap between task categories, while OpenAI Codex delivers broad, high performance across tasks; Claude Code and Cursor outperform Codex in specific tasks; Devin shows a consistent time-based improvement, whereas others plateau. The results underscore the importance of considering task context and temporal dynamics in evaluating AI coding agents and provide practical guidance for tool selection by task and workflow, while acknowledging limitations in causality and generalizability.

Abstract

The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

TL;DR

The study addresses the challenge of comparing AI coding agents under real-world workloads by performing a task-stratified, temporal analysis on 7,156 PRs from the AIDev dataset across five agents. It employs regression-based time trends, LOESS for nonlinearity, and stratified Chi-square/Fisher tests with Bonferroni correction to quantify interactions between agent, task type, and time. Key findings show that task type is the dominant driver of acceptance with a percentage-point gap between task categories, while OpenAI Codex delivers broad, high performance across tasks; Claude Code and Cursor outperform Codex in specific tasks; Devin shows a consistent time-based improvement, whereas others plateau. The results underscore the importance of considering task context and temporal dynamics in evaluating AI coding agents and provide practical guidance for tool selection by task and workflow, while acknowledging limitations in causality and generalizability.

Abstract

The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).
Paper Structure (11 sections, 2 figures, 4 tables)

This paper contains 11 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: RQ1. Acceptance rate over time per agent.
  • Figure 2: RQ3. Acceptance rates (%) by agent and task type. "$\dagger$" indicates $<$20 PRs; "---" indicates zero PRs.