Table of Contents
Fetching ...

Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools

Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, Pankaj Jalote

TL;DR

This paper tackles the challenge of automating unit-test generation by comparing an LLM-based approach (ChatGPT) with a traditional SBST tool (Pynguin) across procedural, function-based, and class-based Python code. The authors assemble a dataset of 109 core modules from 60+49 projects, categorize code by structure, and evaluate test quality using statement and branch coverage along with correctness of assertions; they also explore iterative prompting to improve ChatGPT's coverage. Results show ChatGPT achieves coverage comparable to Pynguin across categories, with iterative prompting significantly boosting coverage for well-structured code and saturation reached after around four iterations. The study highlights limited overlap in missed statements between the two tools, suggesting that a combined approach could yield higher coverage, while also noting that a substantial portion of ChatGPT's assertions can be incorrect, underscoring the need for semantic-grounded assertion generation and broader generalization across languages.

Abstract

Generating unit tests is a crucial task in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing unit test generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin. We also find that about a third of assertions generated by ChatGPT for some categories were incorrect. Our results also show that there is minimal overlap in missed statements between ChatGPT and Pynguin, thus, suggesting that a combination of both tools may enhance unit test generation performance. Finally, in our experiments, prompt engineering improved ChatGPT's performance, achieving a much higher coverage.

Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools

TL;DR

This paper tackles the challenge of automating unit-test generation by comparing an LLM-based approach (ChatGPT) with a traditional SBST tool (Pynguin) across procedural, function-based, and class-based Python code. The authors assemble a dataset of 109 core modules from 60+49 projects, categorize code by structure, and evaluate test quality using statement and branch coverage along with correctness of assertions; they also explore iterative prompting to improve ChatGPT's coverage. Results show ChatGPT achieves coverage comparable to Pynguin across categories, with iterative prompting significantly boosting coverage for well-structured code and saturation reached after around four iterations. The study highlights limited overlap in missed statements between the two tools, suggesting that a combined approach could yield higher coverage, while also noting that a substantial portion of ChatGPT's assertions can be incorrect, underscoring the need for semantic-grounded assertion generation and broader generalization across languages.

Abstract

Generating unit tests is a crucial task in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing unit test generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin. We also find that about a third of assertions generated by ChatGPT for some categories were incorrect. Our results also show that there is minimal overlap in missed statements between ChatGPT and Pynguin, thus, suggesting that a combination of both tools may enhance unit test generation performance. Finally, in our experiments, prompt engineering improved ChatGPT's performance, achieving a much higher coverage.
Paper Structure (17 sections, 7 figures, 5 tables)

This paper contains 17 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Basic Prompt
  • Figure 2: Improvement Prompt
  • Figure 3: Workflow of our Empirical Analysis
  • Figure 4: Statement coverage (left) and Branch coverage (right) obtained by ChatGPT (blue) and Pynguin (red) for all code samples (100-300 LOC).
  • Figure 5: Statement coverage obtained by ChatGPT (blue) and Pynguin (red) for all code samples at different Mccabe Complexities.
  • ...and 2 more figures