Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Wendkûuni C. Ouédraogo; Kader Kaboré; Yinghua Li; Haoye Tian; Anil Koyuncu; Jacques Klein; David Lo; Tegawendé F. Bissyandé

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Wendkûuni C. Ouédraogo, Kader Kaboré, Yinghua Li, Haoye Tian, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

TL;DR

This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, analyzing four models against EvoSuite across 216,300 test cases and suggesting that hybrid approaches combining LLM-based generation with automated validation and search-based refinement are necessary for production-ready results.

Abstract

Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Search-based software testing improves efficiency but produces tests with poor readability and maintainability, while LLMs show promise but lack comprehensive evaluation across reasoning-based prompting and real-world scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, analyzing four models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) against EvoSuite across 216,300 test cases targeting Defects4J, SF110, and CMD. We evaluate five prompting techniques, ZSL, FSL, CoT, ToT, and GToT, assessing compilability, hallucination-driven failures, readability, coverage, and test smells. Reasoning-based prompting, particularly GToT, significantly enhances reliability and compilability, yet hallucination-driven failures remain persistent, with compilation failure rates reaching 86%. While LLM-generated tests are generally more readable than SBST outputs, recurring issues such as Magic Number Tests and Assertion Roulette hinder maintainability. These findings suggest that hybrid approaches combining LLM-based generation with automated validation and search-based refinement are necessary for production-ready results.

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

TL;DR

Abstract

Paper Structure (57 sections, 9 figures, 45 tables)

This paper contains 57 sections, 9 figures, 45 tables.

Introduction
Background
Prompt Engineering
Match Success Rate (MSR) and Code Extraction Success Rate (CSR)
Rationale.
Definitions (adapted to test generation).
Detectors and patterns.
Structural validation for CSR.
Illustrative example (Divider).
What MSR/CSR do not measure.
Synthetic corner cases.
Scope note.
Readability vs. Understandability
Test Smells
Static Analysis Tools: Checkstyle, PMD, and SpotBugs
...and 42 more sections

Figures (9)

Figure 1: A Sample Use of ChatGPT in Unit test Suite generation
Figure 2: Overview of the pipeline used to design the experiments.
Figure 5: Heatmap of compilation errors by model and prompt engineering.
Figure 6: Heatmap of the Top-5 Google and Sun Code Style Violations Across Prompt Engineering
Figure : (a) Chain-of-Thought Prompting
...and 4 more figures

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

TL;DR

Abstract

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)