Table of Contents
Fetching ...

Test smells in LLM-Generated Unit Tests

Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xunzhu Tang, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

TL;DR

This work investigates the quality of unit tests generated by large language models (LLMs) with a focus on test smells that affect readability and maintainability. It conducts a large-scale, multi-benchmark study across four LLMs, EvoSuite, and human-written tests, using two detectors (TsDetect and JNose) to analyze class-level and method-level generation. Key findings show pervasive smells such as Assertion Roulette and issues related to verbosity and missing exception handling in LLM-generated tests, with smell patterns strongly shaped by prompting strategies, context length, and model scale; EvoSuite exhibits different, more template-driven flaws. The study also reveals partial overlap with human-written tests, raising concerns about data leakage and memorization, and highlights the need for smell-aware generation, robust detection tools, and hybrid workflows that combine generation with targeted refinement. Together, these results advance understanding of LLM-based test generation and offer practical guidance for researchers and practitioners aiming to improve the maintainability and reliability of AI-assisted testing pipelines.

Abstract

LLMs promise to transform unit test generation from a manual burden into an automated solution. Yet, beyond metrics such as compilability or coverage, little is known about the quality of LLM-generated tests, particularly their susceptibility to test smells, design flaws that undermine readability and maintainability. This paper presents the first multi-benchmark, large-scale analysis of test smell diffusion in LLM-generated unit tests. We contrast LLM outputs with human-written suites (as the reference for real-world practices) and SBST-generated tests from EvoSuite (as the automated baseline), disentangling whether LLMs reproduce human-like flaws or artifacts of synthetic generation. Our study draws on 20,505 class-level suites from four LLMs (GPT-3.5, GPT-4, Mistral 7B, Mixtral 8x7B), 972 method-level cases from TestBench, 14,469 EvoSuite tests, and 779,585 human-written tests from 34,635 open-source Java projects. Using two complementary detection tools (TsDetect and JNose), we analyze prevalence, co-occurrence, and correlations with software attributes and generation parameters. Results show that LLM-generated tests consistently manifest smells such as Assertion Roulette and Magic Number Test, with patterns strongly influenced by prompting strategy, context length, and model scale. Comparisons reveal overlaps with human-written tests, raising concerns of potential data leakage from training corpora while EvoSuite exhibits distinct, generator-specific flaws. These findings highlight both the promise and the risks of LLM-based test generation, and call for the design of smell-aware generation frameworks, prompt engineering strategies, and enhanced detection tools to ensure maintainable, high-quality test code.

Test smells in LLM-Generated Unit Tests

TL;DR

This work investigates the quality of unit tests generated by large language models (LLMs) with a focus on test smells that affect readability and maintainability. It conducts a large-scale, multi-benchmark study across four LLMs, EvoSuite, and human-written tests, using two detectors (TsDetect and JNose) to analyze class-level and method-level generation. Key findings show pervasive smells such as Assertion Roulette and issues related to verbosity and missing exception handling in LLM-generated tests, with smell patterns strongly shaped by prompting strategies, context length, and model scale; EvoSuite exhibits different, more template-driven flaws. The study also reveals partial overlap with human-written tests, raising concerns about data leakage and memorization, and highlights the need for smell-aware generation, robust detection tools, and hybrid workflows that combine generation with targeted refinement. Together, these results advance understanding of LLM-based test generation and offer practical guidance for researchers and practitioners aiming to improve the maintainability and reliability of AI-assisted testing pipelines.

Abstract

LLMs promise to transform unit test generation from a manual burden into an automated solution. Yet, beyond metrics such as compilability or coverage, little is known about the quality of LLM-generated tests, particularly their susceptibility to test smells, design flaws that undermine readability and maintainability. This paper presents the first multi-benchmark, large-scale analysis of test smell diffusion in LLM-generated unit tests. We contrast LLM outputs with human-written suites (as the reference for real-world practices) and SBST-generated tests from EvoSuite (as the automated baseline), disentangling whether LLMs reproduce human-like flaws or artifacts of synthetic generation. Our study draws on 20,505 class-level suites from four LLMs (GPT-3.5, GPT-4, Mistral 7B, Mixtral 8x7B), 972 method-level cases from TestBench, 14,469 EvoSuite tests, and 779,585 human-written tests from 34,635 open-source Java projects. Using two complementary detection tools (TsDetect and JNose), we analyze prevalence, co-occurrence, and correlations with software attributes and generation parameters. Results show that LLM-generated tests consistently manifest smells such as Assertion Roulette and Magic Number Test, with patterns strongly influenced by prompting strategy, context length, and model scale. Comparisons reveal overlaps with human-written tests, raising concerns of potential data leakage from training corpora while EvoSuite exhibits distinct, generator-specific flaws. These findings highlight both the promise and the risks of LLM-based test generation, and call for the design of smell-aware generation frameworks, prompt engineering strategies, and enhanced detection tools to ensure maintainable, high-quality test code.

Paper Structure

This paper contains 35 sections, 2 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: Overview of the pipeline used to design the analysis.
  • Figure 2: Test Smell Co-occurrence Matrices detected by TsDetect and JNose for Benchmark 1 (LLMs and EvoSuite).
  • Figure 3: Test Smell Co-occurrence Matrices for Benchmark 2 detected by TsDetect and JNose.
  • Figure 4: Spearman Correlation results between Project and LLM Characteristics and Test Smell Presence.
  • Figure 5: PMCC results between Project and LLM Characteristics and Test Smell Presence in Benchmark 1.
  • ...and 3 more figures