Table of Contents
Fetching ...

Refactoring Programs Using Large Language Models with Few-Shot Examples

Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, Yutaka Watanobe

TL;DR

This work addresses the challenge of improving code readability and maintainability without risking functionality by using GPT-3.5 to generate less complex Python programs. It introduces a few-shot prompting pipeline that selects the best refactoring examples for each problem, generates multiple candidates, and validates them with a sandboxed judge to ensure semantic correctness. Quantitatively, it achieves a high success rate (up to 95.68% pass@10) and notable reductions in cyclomatic complexity ($CC$) and lines of code ($LOC$), while qualitatively improving formatting but sometimes deleting or translating comments. The results highlight the potential of LLM-assisted code refactoring for education and software development, while also underscoring limitations in over-editing and comment handling that warrant further study and broader LLM exploration.

Abstract

A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the application of using a large language model (LLM), GPT-3.5, to suggest less complex versions of the user-written Python program, aiming to encourage users to learn how to write better programs. We propose a method to leverage the prompting with few-shot examples of the LLM by selecting the best-suited code refactoring examples for each target programming problem based on the prior evaluation of prompting with the one-shot example. The quantitative evaluation shows that 95.68% of programs can be refactored by generating 10 candidates each, resulting in a 17.35% reduction in the average cyclomatic complexity and a 25.84% decrease in the average number of lines after filtering only generated programs that are semantically correct. Furthermore, the qualitative evaluation shows outstanding capability in code formatting, while unnecessary behaviors such as deleting or translating comments are also observed.

Refactoring Programs Using Large Language Models with Few-Shot Examples

TL;DR

This work addresses the challenge of improving code readability and maintainability without risking functionality by using GPT-3.5 to generate less complex Python programs. It introduces a few-shot prompting pipeline that selects the best refactoring examples for each problem, generates multiple candidates, and validates them with a sandboxed judge to ensure semantic correctness. Quantitatively, it achieves a high success rate (up to 95.68% pass@10) and notable reductions in cyclomatic complexity () and lines of code (), while qualitatively improving formatting but sometimes deleting or translating comments. The results highlight the potential of LLM-assisted code refactoring for education and software development, while also underscoring limitations in over-editing and comment handling that warrant further study and broader LLM exploration.

Abstract

A less complex and more straightforward program is a crucial factor that enhances its maintainability and makes writing secure and bug-free programs easier. However, due to its heavy workload and the risks of breaking the working programs, programmers are reluctant to do code refactoring, and thus, it also causes the loss of potential learning experiences. To mitigate this, we demonstrate the application of using a large language model (LLM), GPT-3.5, to suggest less complex versions of the user-written Python program, aiming to encourage users to learn how to write better programs. We propose a method to leverage the prompting with few-shot examples of the LLM by selecting the best-suited code refactoring examples for each target programming problem based on the prior evaluation of prompting with the one-shot example. The quantitative evaluation shows that 95.68% of programs can be refactored by generating 10 candidates each, resulting in a 17.35% reduction in the average cyclomatic complexity and a 25.84% decrease in the average number of lines after filtering only generated programs that are semantically correct. Furthermore, the qualitative evaluation shows outstanding capability in code formatting, while unnecessary behaviors such as deleting or translating comments are also observed.
Paper Structure (41 sections, 6 equations, 10 figures, 6 tables)

This paper contains 41 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Example of code refactoring to improve a correct program with room for improvement.
  • Figure 2: Illustration of the proposed approach selecting the best-suited code refactoring examples for few-shot prompting for each programming problem based on the performance of one-shot prompting. Only the filtered programs are suggested to a user.
  • Figure 3: Illustration of prompting consisting of (1) a system instruction, (2) zero/one/few-shot examples, and (3) the user's input program. The conversation in blue is the code refactoring example.
  • Figure 4: The difference of pass@10 from zero-shot prompting for each prompting. Higher is better.
  • Figure 5: The difference of CC from zero-shot prompting for each prompting. Lower is better.
  • ...and 5 more figures