Table of Contents
Fetching ...

Agentic LMs: Hunting Down Test Smells

Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d'Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, Elvys Soares

TL;DR

This paper investigates agentic workflows that coordinate small open-language models to detect and refactor test smells in real-world Java tests, aiming for cost-effective automation and cross-language generalization. Using 11 open-source projects and 150 smell instances across five smell types, the study demonstrates near-perfect smell detection and Phi-4-14B as the strongest refactorer, achieving 75.3% pass@5 and competitive performance relative to proprietary LLMs. Multi-agent configurations consistently outperform single-agent setups in most smell types, with several successful pull requests to open-source projects validating practical applicability. The findings highlight broad applicability to Python, Go, and JavaScript, and suggest that extensible, prompt-driven, agentic workflows can meaningfully advance automated test maintenance.

Abstract

Test smells reduce test suite reliability and complicate maintenance. While many methods detect test smells, few support automated removal, and most rely on static analysis or machine learning. This study evaluates models with relatively small parameter counts - Llama-3.2-3B, Gemma-2-9B, DeepSeek-R1-14B, and Phi-4-14B - for their ability to detect and refactor test smells using agent-based workflows. We assess workflows with one, two, and four agents over 150 instances of 5 common smells from real-world Java projects. Our approach generalizes to Python, Golang, and JavaScript. All models detected nearly all instances, with Phi-4-14B achieving the best refactoring accuracy (pass@5 of 75.3%). Phi-4-14B with four-agents performed within 5% of proprietary LLMs (single-agent). Multi-agent setups outperformed single-agent ones in three of five smell types, though for Assertion Roulette, one agent sufficed. We submitted pull requests with Phi-4-14B-generated code to open-source projects and six were merged.

Agentic LMs: Hunting Down Test Smells

TL;DR

This paper investigates agentic workflows that coordinate small open-language models to detect and refactor test smells in real-world Java tests, aiming for cost-effective automation and cross-language generalization. Using 11 open-source projects and 150 smell instances across five smell types, the study demonstrates near-perfect smell detection and Phi-4-14B as the strongest refactorer, achieving 75.3% pass@5 and competitive performance relative to proprietary LLMs. Multi-agent configurations consistently outperform single-agent setups in most smell types, with several successful pull requests to open-source projects validating practical applicability. The findings highlight broad applicability to Python, Go, and JavaScript, and suggest that extensible, prompt-driven, agentic workflows can meaningfully advance automated test maintenance.

Abstract

Test smells reduce test suite reliability and complicate maintenance. While many methods detect test smells, few support automated removal, and most rely on static analysis or machine learning. This study evaluates models with relatively small parameter counts - Llama-3.2-3B, Gemma-2-9B, DeepSeek-R1-14B, and Phi-4-14B - for their ability to detect and refactor test smells using agent-based workflows. We assess workflows with one, two, and four agents over 150 instances of 5 common smells from real-world Java projects. Our approach generalizes to Python, Golang, and JavaScript. All models detected nearly all instances, with Phi-4-14B achieving the best refactoring accuracy (pass@5 of 75.3%). Phi-4-14B with four-agents performed within 5% of proprietary LLMs (single-agent). Multi-agent setups outperformed single-agent ones in three of five smell types, though for Assertion Roulette, one agent sufficed. We submitted pull requests with Phi-4-14B-generated code to open-source projects and six were merged.

Paper Structure

This paper contains 18 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Agentic workflow with one, two and four agents to detect and remove test smells.