Table of Contents
Fetching ...

Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation

Azat Abdullin, Pouria Derakhshanfar, Annibale Panichella

TL;DR

The paper tackles the question of how well LLM-based automatic unit test generation stacks up against traditional SBST and symbolic execution methods, addressing concerns about data leakage and statistical rigor. It implements a leakage-free, large-scale evaluation using EvoSuite, Kex, and TestSpark across the GitBug Java benchmark, with ten seeds per test case and a uniform 120-second budget, analyzing both execution-based metrics (compilation, coverage, mutation, fault reproduction) and code-feature metrics. Key contributions include an open, extensible test-generation assessment pipeline, a comprehensive cross-method comparison, and a co-factor analysis linking tool performance to code complexity, dependencies, and size. The findings show LLM-based approaches are promising but generally underperform in coverage and fault detection relative to SBST and symbolic execution, though they excel in mutation score, with ChatGPT-4o emerging as the most viable LLM among those tested; these results highlight the potential for hybrid, multi-approach test generation to exploit complementary strengths.

Abstract

Generating tests automatically is a key and ongoing area of focus in software engineering research. The emergence of Large Language Models (LLMs) has opened up new opportunities, given their ability to perform a wide spectrum of tasks. However, the effectiveness of LLM-based approaches compared to traditional techniques such as search-based software testing (SBST) and symbolic execution remains uncertain. In this paper, we perform an extensive study of automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation. We evaluate tools performance on the GitBug Java dataset and compare them using various execution-based and feature-based metrics. Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage. However, it significantly outperforms them in mutation scores, suggesting that LLMs provide a deeper semantic understanding of code. LLM-based approach also performed worse than SBST and symbolic execution-based approaches w.r.t. fault detection capabilities. Additionally, our feature-based analysis shows that all tools are primarily affected by the complexity and internal dependencies of the class under test (CUT), with LLM-based approaches being especially sensitive to the CUT size.

Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation

TL;DR

The paper tackles the question of how well LLM-based automatic unit test generation stacks up against traditional SBST and symbolic execution methods, addressing concerns about data leakage and statistical rigor. It implements a leakage-free, large-scale evaluation using EvoSuite, Kex, and TestSpark across the GitBug Java benchmark, with ten seeds per test case and a uniform 120-second budget, analyzing both execution-based metrics (compilation, coverage, mutation, fault reproduction) and code-feature metrics. Key contributions include an open, extensible test-generation assessment pipeline, a comprehensive cross-method comparison, and a co-factor analysis linking tool performance to code complexity, dependencies, and size. The findings show LLM-based approaches are promising but generally underperform in coverage and fault detection relative to SBST and symbolic execution, though they excel in mutation score, with ChatGPT-4o emerging as the most viable LLM among those tested; these results highlight the potential for hybrid, multi-approach test generation to exploit complementary strengths.

Abstract

Generating tests automatically is a key and ongoing area of focus in software engineering research. The emergence of Large Language Models (LLMs) has opened up new opportunities, given their ability to perform a wide spectrum of tasks. However, the effectiveness of LLM-based approaches compared to traditional techniques such as search-based software testing (SBST) and symbolic execution remains uncertain. In this paper, we perform an extensive study of automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation. We evaluate tools performance on the GitBug Java dataset and compare them using various execution-based and feature-based metrics. Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage. However, it significantly outperforms them in mutation scores, suggesting that LLMs provide a deeper semantic understanding of code. LLM-based approach also performed worse than SBST and symbolic execution-based approaches w.r.t. fault detection capabilities. Additionally, our feature-based analysis shows that all tools are primarily affected by the complexity and internal dependencies of the class under test (CUT), with LLM-based approaches being especially sensitive to the CUT size.
Paper Structure (25 sections, 7 figures)

This paper contains 25 sections, 7 figures.

Figures (7)

  • Figure 1: Overview of the pipeline
  • Figure 2: Default prompt used for LLM-based test generation
  • Figure 3: Execution-metrics comparisons of the different LLMs in TestSpark
  • Figure 4: Comparison of different automatic test generation tools
  • Figure 5: Venn diagrams of tools performances across CUTs, computed using pairwise comparisons of tools using $p$-value and effect size
  • ...and 2 more figures