Table of Contents
Fetching ...

An Empirical Study on the Code Refactoring Capability of Large Language Models

Jonathan Cordeiro, Shayan Noei, Ying Zou

TL;DR

The paper addresses automated code refactoring by evaluating StarCoder2 on 30 open-source Java projects and comparing its performance to human developers. It employs a rigorous empirical workflow with refactoring commits, code smells, code-metrics, and EvoSuite-generated tests, and it investigates how prompt engineering (one-shot and chain-of-thought) impacts outcomes. Key findings show StarCoder2 excels at reducing implementation-level smells and improving cohesion and complexity metrics, while developers outperform it on complex, context-dependent design smells and in preserving unit-test success. The study provides practical guidance for integrating LLMs into refactoring workflows and offers a replication package for reproducibility.

Abstract

Large Language Models (LLMs) have shown potential to enhance software development through automated code generation and refactoring, reducing development time and improving code quality. This study empirically evaluates StarCoder2, an LLM optimized for code generation, in refactoring code across 30 open-source Java projects. We compare StarCoder2's performance against human developers, focusing on (1) code quality improvements, (2) types and effectiveness of refactorings, and (3) enhancements through one-shot and chain-of-thought prompting. Our results indicate that StarCoder2 reduces code smells by 20.1% more than developers, excelling in systematic issues like Long Statement and Magic Number, while developers handle complex, context-dependent issues better. One-shot prompting increases the unit test pass rate by 6.15% and improves code smell reduction by 3.52%. Generating five refactorings per input further increases the pass rate by 28.8%, suggesting that combining one-shot prompting with multiple refactorings optimizes performance. These findings provide insights into StarCoder2's potential and best practices for integrating LLMs into software refactoring, supporting more efficient and effective code improvement in real-world applications.

An Empirical Study on the Code Refactoring Capability of Large Language Models

TL;DR

The paper addresses automated code refactoring by evaluating StarCoder2 on 30 open-source Java projects and comparing its performance to human developers. It employs a rigorous empirical workflow with refactoring commits, code smells, code-metrics, and EvoSuite-generated tests, and it investigates how prompt engineering (one-shot and chain-of-thought) impacts outcomes. Key findings show StarCoder2 excels at reducing implementation-level smells and improving cohesion and complexity metrics, while developers outperform it on complex, context-dependent design smells and in preserving unit-test success. The study provides practical guidance for integrating LLMs into refactoring workflows and offers a replication package for reproducibility.

Abstract

Large Language Models (LLMs) have shown potential to enhance software development through automated code generation and refactoring, reducing development time and improving code quality. This study empirically evaluates StarCoder2, an LLM optimized for code generation, in refactoring code across 30 open-source Java projects. We compare StarCoder2's performance against human developers, focusing on (1) code quality improvements, (2) types and effectiveness of refactorings, and (3) enhancements through one-shot and chain-of-thought prompting. Our results indicate that StarCoder2 reduces code smells by 20.1% more than developers, excelling in systematic issues like Long Statement and Magic Number, while developers handle complex, context-dependent issues better. One-shot prompting increases the unit test pass rate by 6.15% and improves code smell reduction by 3.52%. Generating five refactorings per input further increases the pass rate by 28.8%, suggesting that combining one-shot prompting with multiple refactorings optimizes performance. These findings provide insights into StarCoder2's potential and best practices for integrating LLMs into software refactoring, supporting more efficient and effective code improvement in real-world applications.

Paper Structure

This paper contains 32 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of Our Approach for Data Collection and Answering Research Questions.
  • Figure 2: Zero-shot Prompt Used to Instruct StarCoder2 to Conduct Refactoring
  • Figure 3: Distribution of Unit Test Pass Rates Across 30 Software Projects After StarCoder2-Generated Refactorings.
  • Figure 4: Distribution of Code Smells Across 30 Software Projects After LLM-Generated Refactorings.
  • Figure 5: Distribution of Code Smell Reduction Rates Across 30 Projects for StarCoder2-Generated Refactorings and Developer Refactorings.
  • ...and 1 more figures