Table of Contents
Fetching ...

CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, Yingbo Zhou

TL;DR

CLOVER introduces a long-context, coverage-guided benchmark for test-case generation and verification, targeting three progressive tasks that span cloze-style assertion completion, targeted test implementation, and coverage-oriented test generation. By leveraging a coverage-driven oracle and a Dockerized sandbox, CLOVER evaluates 14 models across $4k$–$128k$ token contexts on 12 Python repositories (845 problems), exposing substantial gaps between open-source and proprietary models, especially under heavy long-context demands. The key contributions are a scalable data/sandbox pipeline, an oracle retrieval mechanism calibrated by file-importance, and detailed evaluation of execution versus coverage success across tasks, revealing that even top-tier models struggle with Task III. The work underscores the potential for long-context, coverage-aware evaluation to drive model improvements and provides resources for community use and future research in automated test-generation and software engineering with LLMs.

Abstract

Software testing is a critical aspect of software development, yet generating test cases remains a routine task for engineers. This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases under specific conditions. Spanning from simple assertion completions to writing test cases that cover specific code blocks across multiple files, these tasks are based on 12 python repositories, analyzing 845 problems with context lengths ranging from 4k to 128k tokens. Utilizing code testing frameworks, we propose a method to construct retrieval contexts using coverage information. While models exhibit comparable performance with short contexts, notable differences emerge with 16k contexts. Notably, models like GPT-4o and Claude 3.5 can effectively leverage relevant snippets; however, all models score below 35\% on the complex Task III, even with the oracle context provided, underscoring the benchmark's significance and the potential for model improvement. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.

CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

TL;DR

CLOVER introduces a long-context, coverage-guided benchmark for test-case generation and verification, targeting three progressive tasks that span cloze-style assertion completion, targeted test implementation, and coverage-oriented test generation. By leveraging a coverage-driven oracle and a Dockerized sandbox, CLOVER evaluates 14 models across token contexts on 12 Python repositories (845 problems), exposing substantial gaps between open-source and proprietary models, especially under heavy long-context demands. The key contributions are a scalable data/sandbox pipeline, an oracle retrieval mechanism calibrated by file-importance, and detailed evaluation of execution versus coverage success across tasks, revealing that even top-tier models struggle with Task III. The work underscores the potential for long-context, coverage-aware evaluation to drive model improvements and provides resources for community use and future research in automated test-generation and software engineering with LLMs.

Abstract

Software testing is a critical aspect of software development, yet generating test cases remains a routine task for engineers. This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases under specific conditions. Spanning from simple assertion completions to writing test cases that cover specific code blocks across multiple files, these tasks are based on 12 python repositories, analyzing 845 problems with context lengths ranging from 4k to 128k tokens. Utilizing code testing frameworks, we propose a method to construct retrieval contexts using coverage information. While models exhibit comparable performance with short contexts, notable differences emerge with 16k contexts. Notably, models like GPT-4o and Claude 3.5 can effectively leverage relevant snippets; however, all models score below 35\% on the complex Task III, even with the oracle context provided, underscoring the benchmark's significance and the potential for model improvement. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.

Paper Structure

This paper contains 20 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Pipeline overview. In this example, we focus on a test function test_iter, which covers the use of Token and TokenStream classes from the source code. There are four major steps we extract the problem from a test file test_lexnparse.py.verify of the extracted case(s) by running pytestassemble task prompts with pre-constructed oracle dependent filesobtain model response and verify the execution statusIn Task I, we mask part of the assertion statements. In Task II and III, we ask model to complete the test code almost from scratch with constraints imposed.
  • Figure 2: Task specific prompt template for Task I, II and III. For the complete prompts, check Sec \ref{['app:prompt']}.
  • Figure 3: PO prompt of Task I.
  • Figure 4: Contextual prompt of Task I.
  • Figure 5: Contextual $prompt_{full}$ of Task II.
  • ...and 1 more figures