Table of Contents
Fetching ...

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

Kefan Li, Yuan Yuan

TL;DR

This work investigates whether Large Language Models can reliably generate high-quality test cases for code. It reveals that performance deteriorates on harder problems due to challenges in computation and reasoning, motivating a modular approach. The proposed TestChain framework decouples input generation (Designer agent) from output mapping (Calculator agent) and uses a Python interpreter with a ReAct-style loop to improve accuracy, achieving substantial gains over baselines (e.g., GPT-4 with TestChain reaches 71.79% accuracy on LeetCode-no-exp, with a 13.84 percentage-point improvement on LeetCode-hard). The findings highlight the practical value of tool-assisted, multi-agent test-case generation to enhance software reliability and guide future research in automated testing with LLMs.

Abstract

Code generation with Large Language Models (LLMs) has been extensively studied and achieved remarkable progress. As a complementary aspect to code generation, test case generation is of crucial importance in ensuring the quality and reliability of code. However, using LLMs as test case generators has been much less explored. Current research along this line primarily focuses on enhancing code generation with assistance from test cases generated by LLMs, while the performance of LLMs in test case generation alone has not been comprehensively examined. To bridge this gap, we conduct extensive experiments to study how well LLMs can generate high-quality test cases. We find that as the problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases, largely due to their inherent limitations in computation and reasoning. To mitigate this issue, we further propose a multi-agent framework called \emph{TestChain} that decouples the generation of test inputs and test outputs. Notably, TestChain uses a ReAct format conversation chain for LLMs to interact with a Python interpreter in order to provide more accurate test outputs. Our results indicate that TestChain outperforms the baseline by a large margin. Particularly, in terms of the accuracy of test cases, TestChain using GPT-4 as the backbone achieves a 13.84\% improvement over the baseline on the LeetCode-hard dataset.

Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

TL;DR

This work investigates whether Large Language Models can reliably generate high-quality test cases for code. It reveals that performance deteriorates on harder problems due to challenges in computation and reasoning, motivating a modular approach. The proposed TestChain framework decouples input generation (Designer agent) from output mapping (Calculator agent) and uses a Python interpreter with a ReAct-style loop to improve accuracy, achieving substantial gains over baselines (e.g., GPT-4 with TestChain reaches 71.79% accuracy on LeetCode-no-exp, with a 13.84 percentage-point improvement on LeetCode-hard). The findings highlight the practical value of tool-assisted, multi-agent test-case generation to enhance software reliability and guide future research in automated testing with LLMs.

Abstract

Code generation with Large Language Models (LLMs) has been extensively studied and achieved remarkable progress. As a complementary aspect to code generation, test case generation is of crucial importance in ensuring the quality and reliability of code. However, using LLMs as test case generators has been much less explored. Current research along this line primarily focuses on enhancing code generation with assistance from test cases generated by LLMs, while the performance of LLMs in test case generation alone has not been comprehensively examined. To bridge this gap, we conduct extensive experiments to study how well LLMs can generate high-quality test cases. We find that as the problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases, largely due to their inherent limitations in computation and reasoning. To mitigate this issue, we further propose a multi-agent framework called \emph{TestChain} that decouples the generation of test inputs and test outputs. Notably, TestChain uses a ReAct format conversation chain for LLMs to interact with a Python interpreter in order to provide more accurate test outputs. Our results indicate that TestChain outperforms the baseline by a large margin. Particularly, in terms of the accuracy of test cases, TestChain using GPT-4 as the backbone achieves a 13.84\% improvement over the baseline on the LeetCode-hard dataset.
Paper Structure (9 sections, 5 figures, 1 table)

This paper contains 9 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Formulation of test case generation in this paper, where an assertion encapsulates a test case.
  • Figure 2: Statistics on the number of test cases successfully generated. The gray line indicates the maximum number of test cases allowed by the dataset.
  • Figure 3: Statistics on the number of incorect test cases for each type of error.
  • Figure 4: Illustration of the TestChain framework.
  • Figure 5: Example of the conversation process produced by Calculator agent.