Enhancing Large Language Models for Text-to-Testcase Generation

Saranya Alagarsamy; Chakkrit Tantithamthavorn; Wannita Takerngsaksiri; Chetan Arora; Aldeida Aleti

Enhancing Large Language Models for Text-to-Testcase Generation

Saranya Alagarsamy, Chakkrit Tantithamthavorn, Wannita Takerngsaksiri, Chetan Arora, Aldeida Aleti

TL;DR

The paper addresses the challenge of generating test cases from textual requirements for Test-Driven Development (TDD) without relying on code input. It proposes a Text-to-Testcase approach that fine-tunes GPT-3.5-turbo on a large, curated dataset of method descriptions and test cases, coupled with an effective prompting design. Across five open-source Java projects, the method yields 7k test cases with strong metrics: 78.5% syntax correctness, 67.09% requirement alignment, and 61.7% code coverage, significantly outperforming eight baseline LLMs. An ablation study confirms substantial gains from both fine-tuning and prompting, and a user study suggests practical usability, supporting the viability of LLM-based test generation for TDD in real-world software engineering contexts.

Abstract

Context: Test-driven development (TDD) is a widely employed software development practice that involves developing test cases based on requirements prior to writing the code. Although various methods for automated test case generation have been proposed, they are not specifically tailored for TDD, where requirements instead of code serve as input. Objective: In this paper, we introduce a text-to-testcase generation approach based on a large language model (GPT-3.5) that is fine-tuned on our curated dataset with an effective prompt design. Method: Our approach involves enhancing the capabilities of basic GPT-3.5 for text-to-testcase generation task that is fine-tuned on our curated dataset with an effective prompting design. We evaluated the effectiveness of our approach using a span of five large-scale open-source software projects. Results: Our approach generated 7k test cases for open source projects, achieving 78.5% syntactic correctness, 67.09% requirement alignment, and 61.7% code coverage, which substantially outperforms all other LLMs (basic GPT-3.5, Bloom, and CodeT5). In addition, our ablation study demonstrates the substantial performance improvement of the fine-tuning and prompting components of the GPT-3.5 model. Conclusions: These findings lead us to conclude that fine-tuning and prompting should be considered in the future when building a language model for the text-to-testcase generation task

Enhancing Large Language Models for Text-to-Testcase Generation

TL;DR

Abstract

Paper Structure (27 sections, 6 figures, 5 tables)

This paper contains 27 sections, 6 figures, 5 tables.

Introduction
Background and Related Work
Test Driven Development (TDD)
Large Language Models (LLMs)
Related Work
Automated Test Case Generation
LLMs for Test Case Generation
Effectiveness of Test Case
Experimental Design
Research Questions
Evaluation Dataset
Baseline Comparisons
Hyper-Parameter Settings for Fine-Tuning.
Evaluation Measures
Enhancing Large Language Models for Text-To-Testcase Generation
...and 12 more sections

Figures (6)

Figure 1: An overview of our approach (i.e., the prompting-based fine-tuning GPT-3.5-turbo model for the Text-To-Testcase generation task.
Figure 2: Structure of our Improved prompt and Basic prompt.
Figure 3: Model input and the generated test case and the ground-truth for evaluation.
Figure 4: (RQ2)The experimental results of the ablation study with GPT3.5 model
Figure 5: (RQ4)The common types of errors that occur in the generated test cases.
...and 1 more figures

Enhancing Large Language Models for Text-to-Testcase Generation

TL;DR

Abstract

Enhancing Large Language Models for Text-to-Testcase Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)