Table of Contents
Fetching ...

MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration

Yisen Xu, Feng Lin, Jinqiu Yang, Tse-Hsun, Chen, Nikolaos Tsantalis

TL;DR

This paper addresses the labor-intensive challenge of method-level code refactoring by introducing MANTRA, an end-to-end LLM-agent framework that combines context-aware retrieval-augmented generation, multi-agent collaboration, and verbal reinforcement learning-based self-repair. It constructs a contextual RAG database of pure refactorings, orchestrates a developer-reviewer-repair loop to generate and validate compilable, test-passing refactorings, and uses a reflexive repair strategy to fix remaining issues. Empirical evaluation on 703 pure refactorings across 10 Java projects shows MANTRA achieving an 82.8% success rate in producing compilable and test-passing refactorings, vastly surpassing RawGPT and outperforming IntelliJ’s EM-Assist in Extract Method tasks. A user study with 37 developers indicates MANTRA-generated refactorings can be as readable and reusable as human-written code, with situational advantages in certain refactoring types, underscoring the practical impact of integrating LLMs with traditional software engineering tools for automated maintenance.

Abstract

Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor-intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution. In this work, we introduce MANTRA, a comprehensive LLM agent-based framework that automates method-level refactoring. MANTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision-making during refactoring while preserving code correctness and readability. Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations. Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT. Moreover, in comparison to IntelliJ's LLM-powered refactoring tool (EM-Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations. A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human-written code, and in certain cases, even more favorable. These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM-based systems in advancing the automation of software refactoring tasks.

MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration

TL;DR

This paper addresses the labor-intensive challenge of method-level code refactoring by introducing MANTRA, an end-to-end LLM-agent framework that combines context-aware retrieval-augmented generation, multi-agent collaboration, and verbal reinforcement learning-based self-repair. It constructs a contextual RAG database of pure refactorings, orchestrates a developer-reviewer-repair loop to generate and validate compilable, test-passing refactorings, and uses a reflexive repair strategy to fix remaining issues. Empirical evaluation on 703 pure refactorings across 10 Java projects shows MANTRA achieving an 82.8% success rate in producing compilable and test-passing refactorings, vastly surpassing RawGPT and outperforming IntelliJ’s EM-Assist in Extract Method tasks. A user study with 37 developers indicates MANTRA-generated refactorings can be as readable and reusable as human-written code, with situational advantages in certain refactoring types, underscoring the practical impact of integrating LLMs with traditional software engineering tools for automated maintenance.

Abstract

Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor-intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution. In this work, we introduce MANTRA, a comprehensive LLM agent-based framework that automates method-level refactoring. MANTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision-making during refactoring while preserving code correctness and readability. Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations. Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT. Moreover, in comparison to IntelliJ's LLM-powered refactoring tool (EM-Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations. A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human-written code, and in certain cases, even more favorable. These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM-based systems in advancing the automation of software refactoring tasks.

Paper Structure

This paper contains 10 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An overview of how MANTRA constructs a database containing only pure-refactoring for RAG.
  • Figure 2: An overview of MANTRA.
  • Figure 3: Left panel: Boxplots depicting the readability and reusability scores from the questionnaire, comparing MANTRA-generated code with human-written code. White markers indicate the mean score for each refactoring category. Right panel: A visualization of participants’ preferences regarding which code they favor.
  • Figure 4: An example illustrating how MANTRA and human developers implemented the Extract & Move refactoring.
  • Figure 5: Contribution of each component in MANTRA. Compile&Test Success shows the number of generated code that compiles and passes all tests. Successful refacotorings means the number of verified code that contains the specific refactoring.