Table of Contents
Fetching ...

Agentic Refactoring: An Empirical Study of AI Coding Agents

Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, Ahmed E. Hassan

TL;DR

The paper addresses the emergence of autonomous AI coding agents in refactoring and provides the first large-scale empirical analysis of agentic refactorings in real-world open-source Java projects. By leveraging the AIDev dataset, RefactoringMiner, and DesigniteJava, the authors quantify prevalence (agentic refactoring in 26.1% of agentic commits), categorize refactoring types (predominantly low-level edits), analyze motivations (maintainability 52.5% and readability 28.1%), and assess impact on code quality (statistically significant but practically small improvements, especially for medium-level refactorings with a median Class LOC change of $\Delta = -15.25$). The results reveal a gap between agentic refactoring activity and substantial architectural impact, suggesting agents currently excel at localized consistency work but lag in high-level design transformations. The findings inform researchers, developers, and tool builders about current capabilities, limitations, and directions for advancing architecturally-aware, autonomous refactoring with AI agents.

Abstract

Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape. These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks. Agents have become active participants in refactoring, a cornerstone of sustainable software development aimed at improving internal code quality without altering observable behavior. Despite their increasing adoption, there is a critical lack of empirical understanding regarding how agentic refactoring is utilized in practice, how it compares to human-driven refactoring, and what impact it has on code quality. To address this empirical gap, we present a large-scale study of AI agent-generated refactorings in real-world open-source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,988 commits derived from the AIDev dataset. Our empirical analysis shows that refactoring is a common and intentional activity in this development paradigm, with agents explicitly targeting refactoring in 26.1% of commits. Analysis of refactoring types reveals that agentic efforts are dominated by low-level, consistency-oriented edits, such as Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%), reflecting a preference for localized improvements over the high-level design changes common in human refactoring. Additionally, the motivations behind agentic refactoring focus overwhelmingly on internal quality concerns, with maintainability (52.5%) and readability (28.1%). Furthermore, quantitative evaluation of code quality metrics shows that agentic refactoring yields small but statistically significant improvements in structural metrics, particularly for medium-level changes, reducing class size and complexity (e.g., Class LOC median $Δ$ = -15.25).

Agentic Refactoring: An Empirical Study of AI Coding Agents

TL;DR

The paper addresses the emergence of autonomous AI coding agents in refactoring and provides the first large-scale empirical analysis of agentic refactorings in real-world open-source Java projects. By leveraging the AIDev dataset, RefactoringMiner, and DesigniteJava, the authors quantify prevalence (agentic refactoring in 26.1% of agentic commits), categorize refactoring types (predominantly low-level edits), analyze motivations (maintainability 52.5% and readability 28.1%), and assess impact on code quality (statistically significant but practically small improvements, especially for medium-level refactorings with a median Class LOC change of ). The results reveal a gap between agentic refactoring activity and substantial architectural impact, suggesting agents currently excel at localized consistency work but lag in high-level design transformations. The findings inform researchers, developers, and tool builders about current capabilities, limitations, and directions for advancing architecturally-aware, autonomous refactoring with AI agents.

Abstract

Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape. These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks. Agents have become active participants in refactoring, a cornerstone of sustainable software development aimed at improving internal code quality without altering observable behavior. Despite their increasing adoption, there is a critical lack of empirical understanding regarding how agentic refactoring is utilized in practice, how it compares to human-driven refactoring, and what impact it has on code quality. To address this empirical gap, we present a large-scale study of AI agent-generated refactorings in real-world open-source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,988 commits derived from the AIDev dataset. Our empirical analysis shows that refactoring is a common and intentional activity in this development paradigm, with agents explicitly targeting refactoring in 26.1% of commits. Analysis of refactoring types reveals that agentic efforts are dominated by low-level, consistency-oriented edits, such as Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%), reflecting a preference for localized improvements over the high-level design changes common in human refactoring. Additionally, the motivations behind agentic refactoring focus overwhelmingly on internal quality concerns, with maintainability (52.5%) and readability (28.1%). Furthermore, quantitative evaluation of code quality metrics shows that agentic refactoring yields small but statistically significant improvements in structural metrics, particularly for medium-level changes, reducing class size and complexity (e.g., Class LOC median = -15.25).

Paper Structure

This paper contains 45 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Examples of agentic refactoring.
  • Figure 2: Overview of the study design.
  • Figure 3: Distribution of refactoring instances per refactoring commit (agentic vs. others).
  • Figure 4: Refactoring purpose comparison between agents and humans (normalized) DBLP:journals/tse/KimZN14
  • Figure 5: Smell Count Distribution (Before vs. After)