Table of Contents
Fetching ...

Automatic High-Level Test Case Generation using Large Language Models

Navid Bin Hasan, Md. Ashraful Islam, Junaed Younus Khan, Sanjida Senjik, Anindya Iqbal

TL;DR

The paper addresses the misalignment between business requirements and software testing by proposing automatic generation of high-level test cases from use cases. It builds a dataset of 1067 use case–test case pairs (including real-world and student projects) and evaluates both pre-trained LLMs (e.g., GPT-4o, Gemini) and fine-tuned smaller LLMs (LLaMA 3.1 8B, Mistral 7B) for this task. One-shot prompting with GPT-4o yields higher semantic alignment (BERTScore) than Gemini, while fine-tuned open-source models achieve comparable or superior precision, recall, and F1, enabling privacy-preserving in-house deployment. Human evaluators rate the automated test cases as readable, usable, and generally correct, though completeness and relevance vary by model; the work includes discussions on augmented context and retrieval strategies, as well as threats to validity and avenues for future research.

Abstract

We explored the challenges practitioners face in software testing and proposed automated solutions to address these obstacles. We began with a survey of local software companies and 26 practitioners, revealing that the primary challenge is not writing test scripts but aligning testing efforts with business requirements. Based on these insights, we constructed a use-case $\rightarrow$ (high-level) test-cases dataset to train/fine-tune models for generating high-level test cases. High-level test cases specify what aspects of the software's functionality need to be tested, along with the expected outcomes. We evaluated large language models, such as GPT-4o, Gemini, LLaMA 3.1 8B, and Mistral 7B, where fine-tuning (the latter two) yields improved performance. A final (human evaluation) survey confirmed the effectiveness of these generated test cases. Our proactive approach strengthens requirement-testing alignment and facilitates early test case generation to streamline development.

Automatic High-Level Test Case Generation using Large Language Models

TL;DR

The paper addresses the misalignment between business requirements and software testing by proposing automatic generation of high-level test cases from use cases. It builds a dataset of 1067 use case–test case pairs (including real-world and student projects) and evaluates both pre-trained LLMs (e.g., GPT-4o, Gemini) and fine-tuned smaller LLMs (LLaMA 3.1 8B, Mistral 7B) for this task. One-shot prompting with GPT-4o yields higher semantic alignment (BERTScore) than Gemini, while fine-tuned open-source models achieve comparable or superior precision, recall, and F1, enabling privacy-preserving in-house deployment. Human evaluators rate the automated test cases as readable, usable, and generally correct, though completeness and relevance vary by model; the work includes discussions on augmented context and retrieval strategies, as well as threats to validity and avenues for future research.

Abstract

We explored the challenges practitioners face in software testing and proposed automated solutions to address these obstacles. We began with a survey of local software companies and 26 practitioners, revealing that the primary challenge is not writing test scripts but aligning testing efforts with business requirements. Based on these insights, we constructed a use-case (high-level) test-cases dataset to train/fine-tune models for generating high-level test cases. High-level test cases specify what aspects of the software's functionality need to be tested, along with the expected outcomes. We evaluated large language models, such as GPT-4o, Gemini, LLaMA 3.1 8B, and Mistral 7B, where fine-tuning (the latter two) yields improved performance. A final (human evaluation) survey confirmed the effectiveness of these generated test cases. Our proactive approach strengthens requirement-testing alignment and facilitates early test case generation to streamline development.

Paper Structure

This paper contains 35 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Use case to Test cases Generation.
  • Figure 2: The four major phases of this study.
  • Figure 3: Prompt for generating test cases from a use case using pre-trained LLMs.