Table of Contents
Fetching ...

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing

TL;DR

ProjectTest establishes the first project-level benchmark for unit test generation across Python, Java, and JavaScript, using 20 curated projects per language to push frontier LLMs beyond function-level evaluation. It introduces a three-scenario pipeline (vanilla generation, manual error fixing, and LLM self-fixing) and analyzes compilation, correctness, and coverage while detailing per-language error patterns. Key findings show moderate performance for most frontier LLMs in Python and Java with Java being the hardest, and reveal that manual error fixing significantly boosts results—especially for Java—while self-fixing offers more limited gains, heavily influenced by model type and context length. The work highlights the significant impact of compilation and cascade errors on test quality and demonstrates the potential of error-fixing pipelines to unlock better project-level unit test generation, with code and data made publicly available for reproducibility and further research.

Abstract

Unit test generation has become a promising and important use case of LLMs. However, existing evaluation benchmarks for assessing LLM unit test generation capabilities focus on function- or class-level code rather than more practical and challenging project-level codebases. To address such limitation, we propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language. We evaluate nine frontier LLMs on ProjectTest and the results show that all frontier LLMs tested exhibit moderate performance on ProjectTest on Python and Java, highlighting the difficulty of ProjectTest. We also conduct a thorough error analysis, which shows that even frontier LLMs, such as Claude-3.5-Sonnet, have significant basic yet critical errors, including compilation and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{ProjectTest}.

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

TL;DR

ProjectTest establishes the first project-level benchmark for unit test generation across Python, Java, and JavaScript, using 20 curated projects per language to push frontier LLMs beyond function-level evaluation. It introduces a three-scenario pipeline (vanilla generation, manual error fixing, and LLM self-fixing) and analyzes compilation, correctness, and coverage while detailing per-language error patterns. Key findings show moderate performance for most frontier LLMs in Python and Java with Java being the hardest, and reveal that manual error fixing significantly boosts results—especially for Java—while self-fixing offers more limited gains, heavily influenced by model type and context length. The work highlights the significant impact of compilation and cascade errors on test quality and demonstrates the potential of error-fixing pipelines to unlock better project-level unit test generation, with code and data made publicly available for reproducibility and further research.

Abstract

Unit test generation has become a promising and important use case of LLMs. However, existing evaluation benchmarks for assessing LLM unit test generation capabilities focus on function- or class-level code rather than more practical and challenging project-level codebases. To address such limitation, we propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language. We evaluate nine frontier LLMs on ProjectTest and the results show that all frontier LLMs tested exhibit moderate performance on ProjectTest on Python and Java, highlighting the difficulty of ProjectTest. We also conduct a thorough error analysis, which shows that even frontier LLMs, such as Claude-3.5-Sonnet, have significant basic yet critical errors, including compilation and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{ProjectTest}.

Paper Structure

This paper contains 31 sections, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Overview of the unit test generation process.
  • Figure 2: An example of ProjectTest.
  • Figure 3: The prompt used to generate unit tests for Python projects. Purple indicates language-specific instruction.Blue, orange, and red indicates instructions related to compilation rate, correctness rate, and coverage rate, respectively.
  • Figure 4: An example of compilation error generated by GPT-4-Turbo.
  • Figure 5: An example of cascade error generated by CodeQwen1.5-7B-Chat.
  • ...and 7 more figures