TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang; Chenyuan Yang; Zhijie Wang; Yuheng Huang; Zhaoyang Chu; Da Song; Lingming Zhang; An Ran Chen; Lei Ma

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, Lei Ma

TL;DR

TESTEVAL introduces a new benchmark for evaluating LLM-driven test-case generation on Python programs. It uses 210 LeetCode solutions and defines three tasks—overall coverage, targeted line/branch coverage, and targeted path coverage—to probe LLMs' reasoning about program execution. Across 17 LLMs, results show strong overall coverage and executable test cases, but substantial gaps remain in targeting specific statements, branches, or execution paths, highlighting the need for improved program-logic understanding and prompting strategies. The work provides open-source data and pipelines to spur future research toward more capable LLM-based software testing systems, including advanced reasoning frameworks and cost-effective prompting methods.

Abstract

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities. In this paper, we propose TESTEVAL, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate sixteen popular LLMs, including both commercial and open-source ones, on TESTEVAL. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths. We have open-sourced our dataset and benchmark pipelines at https://github.com/LLM4SoftwareTesting/TestEval.

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 6 figures, 6 tables, 2 algorithms)

This paper contains 21 sections, 1 equation, 6 figures, 6 tables, 2 algorithms.

Introduction
Approach
Task Description
Benchmark Dataset
Evaluation
Experiment Setup
Overall Coverage
Targeted Line and Branch Coverage
Targeted Path Coverage
Advanced Prompting
Related Work
Conclusion
Prompt Templates
Prompt Template for Overall Coverage
Prompt Template for Targeted Line Coverage
...and 6 more sections

Figures (6)

Figure 1: The pipeline for running and evaluating LLMs for test case generation on .
Figure 2: An example for selecting targeted lines and branches from programs under test.
Figure 3: A motivating example showing the importance of path coverage (left), and examples of execution paths extracted from this program (right).
Figure 4: The input constraints for a LeetCode problem (left) and its random input generator for (right).
Figure 5: Example of a generated test case that failed to cover the target line. (a): the program under test. (b): LLM-generated reasoning steps. (c): LLM-generated test cases based on reasoning steps.
...and 1 more figures

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

TL;DR

Abstract

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)