Table of Contents
Fetching ...

Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs

Alessandro Midolo, Massimiliano Di Penta

TL;DR

The paper investigates whether GPT-4 can reliably refactor non idiomatic Python code into Pythonic idioms, by replicating Zhang et al. on a dataset of over one thousand methods and comparing GPT-4's proposals to a curated baseline. It reports that GPT-4 generates substantially more refactorings while maintaining high correctness, and that many of its suggestions align with or exceed the benchmark, though some edge cases and specialized idioms still benefit from human insight. The study demonstrates the potential of LLMs to complement traditional static-analysis based refactoring tools, enabling broader and more flexible code improvements. The findings advocate hybrid workflows that leverage both LLM capabilities and rule-based approaches to achieve robust and scalable Python code modernization.

Abstract

In the Python ecosystem, the adoption of idiomatic constructs has been fostered because of their expressiveness, increasing productivity and even efficiency, despite controversial arguments concerning familiarity or understandability issues. Recent research contributions have proposed approaches -- based on static code analysis and transformation -- to automatically identify and enact refactoring opportunities of non-idiomatic code into idiomatic ones. Given the potential recently offered by Large Language Models (LLMs) for code-related tasks, in this paper, we present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic refactoring actions. Our results reveal that GPT-4 not only identifies idiomatic constructs effectively but frequently exceeds the benchmark in proposing refactoring actions where the existing baseline failed. A manual analysis of a random sample shows the correctness of the obtained recommendations. Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.

Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs

TL;DR

The paper investigates whether GPT-4 can reliably refactor non idiomatic Python code into Pythonic idioms, by replicating Zhang et al. on a dataset of over one thousand methods and comparing GPT-4's proposals to a curated baseline. It reports that GPT-4 generates substantially more refactorings while maintaining high correctness, and that many of its suggestions align with or exceed the benchmark, though some edge cases and specialized idioms still benefit from human insight. The study demonstrates the potential of LLMs to complement traditional static-analysis based refactoring tools, enabling broader and more flexible code improvements. The findings advocate hybrid workflows that leverage both LLM capabilities and rule-based approaches to achieve robust and scalable Python code modernization.

Abstract

In the Python ecosystem, the adoption of idiomatic constructs has been fostered because of their expressiveness, increasing productivity and even efficiency, despite controversial arguments concerning familiarity or understandability issues. Recent research contributions have proposed approaches -- based on static code analysis and transformation -- to automatically identify and enact refactoring opportunities of non-idiomatic code into idiomatic ones. Given the potential recently offered by Large Language Models (LLMs) for code-related tasks, in this paper, we present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic refactoring actions. Our results reveal that GPT-4 not only identifies idiomatic constructs effectively but frequently exceeds the benchmark in proposing refactoring actions where the existing baseline failed. A manual analysis of a random sample shows the correctness of the obtained recommendations. Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.

Paper Structure

This paper contains 15 sections, 5 figures, 3 tables.

Figures (5)

  • Figure : An example of refactoring generated by GPT-4, where the 'with' idiom is used to handle the opening of a mp4 file. On the left side, the original code extracted from the repository. On the right side, the refactoring proposed by GPT-4.
  • Figure : An example of refactoring where GPT-4 proposes more refactoring actions than the benchmark zhang2024automated. The Pythonic idiom used in this refactoring is the truth value test. On the left side, the refactoring proposed by GPT-4. On the right side, the refactoring proposed in the benchmark.
  • Figure : An example of refactoring where the benchmark zhang2024automated proposes more refactoring actions than GPT-4. The Pythonic idiom used in this code is the assign multiple target. On the left side, the refactoring proposed by GPT-4. On the right side, the refactoring proposed in the benchmark.
  • Figure : An example of refactoring where GPT-4 does not provide any action compared to the benchmark zhang2024automated solution. The Pythonic idiom used in this code is the chain comparison. On the left side, the original code. On the right side, the refactoring proposed in the benchmark.
  • Figure : Two examples of incorrect refactorings proposed by the two approaches. On the left side, the refactoring using chain comparison idiom proposed by GPT-4. On the right side, the refactoring using fstring idiom proposed by the benchmark zhang2024automated.