LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework
Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, Claudio Bartolini
TL;DR
AgoneTest provides an automated end-to-end framework to evaluate LLM-generated Java unit tests at the class level, paired with the Classes2Test dataset to enable realistic benchmarking across multiple LLMs and prompts. It pairs a configurable prompt-engineering pipeline with automated test generation, integration, and comprehensive quality metrics (code coverage, mutation score, and test smells) to compare LLMs against human-written tests. Key findings show that, among compilable tests, LLM-generated suites can match or surpass human tests, with prompt engineering (notably few-shot/contextual prompts) improving quality and compilation success when enhanced with explicit class-path information. The framework serves as a reusable benchmark for model design and prompt strategies, and points to future work in broader language support, higher compilation rates, and richer evaluation protocols.
Abstract
Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.
