Code Refactoring with LLM: A Comprehensive Evaluation With Few-Shot Settings
Md. Raihan Tapader, Md. Mostafizer Rahman, Ariful Islam Shiplu, Md Faizul Ibne Amin, Yutaka Watanobe
TL;DR
The study addresses multilingual code refactoring with large language models by designing a fine-tuned, prompt-engineered, few-shot framework evaluated on C, C++, C#, Python, and Java. It introduces a multilingual dataset, ten specialized prompts, and a workflow that generates multiple refactorings per input, validated via tooling and human judgment. Key findings show Java achieving up to 99.99% correctness in 10-shot settings with high compilability, while Python yields the smallest cyclomatic complexity, and all languages maintain high functional accuracy across shot configurations. The results demonstrate that LLM-based refactoring can preserve semantics while achieving substantial code-size and complexity reductions, with language-specific patterns guiding prompt and sample design.
Abstract
In today's world, the focus of programmers has shifted from writing complex, error-prone code to prioritizing simple, clear, efficient, and sustainable code that makes programs easier to understand. Code refactoring plays a critical role in this transition by improving structural organization and optimizing performance. However, existing refactoring methods are limited in their ability to generalize across multiple programming languages and coding styles, as they often rely on manually crafted transformation rules. The objectives of this study are to (i) develop an Large Language Models (LLMs)-based framework capable of performing accurate and efficient code refactoring across multiple languages (C, C++, C#, Python, Java), (ii) investigate the impact of prompt engineering (Temperature, Different shot algorithm) and instruction fine-tuning on refactoring effectiveness, and (iii) evaluate the quality improvements (Compilability, Correctness, Distance, Similarity, Number of Lines, Token, Character, Cyclomatic Complexity) in refactored code through empirical metrics and human assessment. To accomplish these goals, we propose a fine-tuned prompt-engineering-based model combined with few-shot learning for multilingual code refactoring. Experimental results indicate that Java achieves the highest overall correctness up to 99.99% the 10-shot setting, records the highest average compilability of 94.78% compared to the original source code and maintains high similarity (Approx. 53-54%) and thus demonstrates a strong balance between structural modifications and semantic preservation. Python exhibits the lowest structural distance across all shots (Approx. 277-294) while achieving moderate similarity ( Approx. 44-48%) that indicates consistent and minimally disruptive refactoring.
