Table of Contents
Fetching ...

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Wendkûuni C. Ouédraogo, Kader Kaboré, Yinghua Li, Haoye Tian, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

TL;DR

This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, analyzing four models against EvoSuite across 216,300 test cases and suggesting that hybrid approaches combining LLM-based generation with automated validation and search-based refinement are necessary for production-ready results.

Abstract

Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Search-based software testing improves efficiency but produces tests with poor readability and maintainability, while LLMs show promise but lack comprehensive evaluation across reasoning-based prompting and real-world scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, analyzing four models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) against EvoSuite across 216,300 test cases targeting Defects4J, SF110, and CMD. We evaluate five prompting techniques, ZSL, FSL, CoT, ToT, and GToT, assessing compilability, hallucination-driven failures, readability, coverage, and test smells. Reasoning-based prompting, particularly GToT, significantly enhances reliability and compilability, yet hallucination-driven failures remain persistent, with compilation failure rates reaching 86%. While LLM-generated tests are generally more readable than SBST outputs, recurring issues such as Magic Number Tests and Assertion Roulette hinder maintainability. These findings suggest that hybrid approaches combining LLM-based generation with automated validation and search-based refinement are necessary for production-ready results.

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

TL;DR

This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, analyzing four models against EvoSuite across 216,300 test cases and suggesting that hybrid approaches combining LLM-based generation with automated validation and search-based refinement are necessary for production-ready results.

Abstract

Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Search-based software testing improves efficiency but produces tests with poor readability and maintainability, while LLMs show promise but lack comprehensive evaluation across reasoning-based prompting and real-world scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, analyzing four models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) against EvoSuite across 216,300 test cases targeting Defects4J, SF110, and CMD. We evaluate five prompting techniques, ZSL, FSL, CoT, ToT, and GToT, assessing compilability, hallucination-driven failures, readability, coverage, and test smells. Reasoning-based prompting, particularly GToT, significantly enhances reliability and compilability, yet hallucination-driven failures remain persistent, with compilation failure rates reaching 86%. While LLM-generated tests are generally more readable than SBST outputs, recurring issues such as Magic Number Tests and Assertion Roulette hinder maintainability. These findings suggest that hybrid approaches combining LLM-based generation with automated validation and search-based refinement are necessary for production-ready results.
Paper Structure (57 sections, 9 figures, 45 tables)

This paper contains 57 sections, 9 figures, 45 tables.

Figures (9)

  • Figure 1: A Sample Use of ChatGPT in Unit test Suite generation
  • Figure 2: Overview of the pipeline used to design the experiments.
  • Figure 5: Heatmap of compilation errors by model and prompt engineering.
  • Figure 6: Heatmap of the Top-5 Google and Sun Code Style Violations Across Prompt Engineering
  • Figure : (a) Chain-of-Thought Prompting
  • ...and 4 more figures