Table of Contents
Fetching ...

RefModel: Detecting Refactorings using Foundation Models

Pedro Simões, Rohit Gheyi, Rian Melo, Jonhnanthan Oliveira, Márcio Ribeiro, Wesley K. G. Assunção

TL;DR

This work investigates RefModel, a tool that leverages foundation models to detect code refactorings, addressing the limitations of rule-based and static-analysis approaches. Evaluations on synthetic Java transformations and real-world refactorings show that large models like Claude 3.5 Sonnet and Gemini 2.5 Pro provide high recall and precision, often matching or surpassing traditional detectors such as RefactoringMiner, RefDiff, and ReExtractor+. The study demonstrates strong cross-language applicability (including Python and Go) and highlights the benefits of natural language explanations and a single-sentence definition for each refactoring type. Prompt design and retrieval-augmented strategies emerge as promising directions to enhance accuracy on larger codebases, while acknowledging trade-offs in runtime and cost. Overall, foundation models offer a flexible, language-agnostic approach to refactoring detection with substantial practical impact for software maintenance and evolution.

Abstract

Refactoring is a common software engineering practice that improves code quality without altering program behavior. Although tools like ReExtractor+, RefactoringMiner, and RefDiff have been developed to detect refactorings automatically, they rely on complex rule definitions and static analysis, making them difficult to extend and generalize to other programming languages. In this paper, we investigate the viability of using foundation models for refactoring detection, implemented in a tool named RefModel. We evaluate Phi4-14B, and Claude 3.5 Sonnet on a dataset of 858 single-operation transformations applied to artificially generated Java programs, covering widely-used refactoring types. We also extend our evaluation by including Gemini 2.5 Pro and o4-mini-high, assessing their performance on 44 real-world refactorings extracted from four open-source projects. These models are compared against RefactoringMiner, RefDiff, and ReExtractor+. RefModel is competitive with, and in some cases outperform, traditional tools. In real-world settings, Claude 3.5 Sonnet and Gemini 2.5 Pro jointly identified 97% of all refactorings, surpassing the best-performing static-analysis-based tools. The models showed encouraging generalization to Python and Golang. They provide natural language explanations and require only a single sentence to define each refactoring type.

RefModel: Detecting Refactorings using Foundation Models

TL;DR

This work investigates RefModel, a tool that leverages foundation models to detect code refactorings, addressing the limitations of rule-based and static-analysis approaches. Evaluations on synthetic Java transformations and real-world refactorings show that large models like Claude 3.5 Sonnet and Gemini 2.5 Pro provide high recall and precision, often matching or surpassing traditional detectors such as RefactoringMiner, RefDiff, and ReExtractor+. The study demonstrates strong cross-language applicability (including Python and Go) and highlights the benefits of natural language explanations and a single-sentence definition for each refactoring type. Prompt design and retrieval-augmented strategies emerge as promising directions to enhance accuracy on larger codebases, while acknowledging trade-offs in runtime and cost. Overall, foundation models offer a flexible, language-agnostic approach to refactoring detection with substantial practical impact for software maintenance and evolution.

Abstract

Refactoring is a common software engineering practice that improves code quality without altering program behavior. Although tools like ReExtractor+, RefactoringMiner, and RefDiff have been developed to detect refactorings automatically, they rely on complex rule definitions and static analysis, making them difficult to extend and generalize to other programming languages. In this paper, we investigate the viability of using foundation models for refactoring detection, implemented in a tool named RefModel. We evaluate Phi4-14B, and Claude 3.5 Sonnet on a dataset of 858 single-operation transformations applied to artificially generated Java programs, covering widely-used refactoring types. We also extend our evaluation by including Gemini 2.5 Pro and o4-mini-high, assessing their performance on 44 real-world refactorings extracted from four open-source projects. These models are compared against RefactoringMiner, RefDiff, and ReExtractor+. RefModel is competitive with, and in some cases outperform, traditional tools. In real-world settings, Claude 3.5 Sonnet and Gemini 2.5 Pro jointly identified 97% of all refactorings, surpassing the best-performing static-analysis-based tools. The models showed encouraging generalization to Python and Golang. They provide natural language explanations and require only a single sentence to define each refactoring type.

Paper Structure

This paper contains 23 sections, 5 tables.