Table of Contents
Fetching ...

How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Libo Qin, Wanxiang Che

TL;DR

The paper addresses the challenge of evaluating algorithmic test-case generation for LLMs, where golden test cases are costly and naive evaluations suffer from score inflation. It introduces a binary Code-Test matrix framework, where $M\in\{0,1\}^{n\times d}$ encodes whether wrong codes fail golden tests, and shows that the matrix rank $\mathrm{rank}(M)$ bounds the number of independent error patterns and the minimal number of test cases needed. To compute a principled, diverse diagnostic basis, it proposes WrongSelect—a greedy approximation that minimizes the average pairwise Jaccard diversity among basis rows, built on principled pre-filtering and random-restart local search. The authors construct TC-Bench by curating data from major contests, applying the rank-based framework, and selecting a compact, diverse set of 9,347 core wrong codes across 877 problems; they evaluate 13 LLMs, revealing a substantial gap in current test-case generation methods, an inflation tendency in unfiltered benchmarks, and the practical value of the rank-based, inflation-resistant benchmark for guiding future improvements.

Abstract

Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.

How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

TL;DR

The paper addresses the challenge of evaluating algorithmic test-case generation for LLMs, where golden test cases are costly and naive evaluations suffer from score inflation. It introduces a binary Code-Test matrix framework, where encodes whether wrong codes fail golden tests, and shows that the matrix rank bounds the number of independent error patterns and the minimal number of test cases needed. To compute a principled, diverse diagnostic basis, it proposes WrongSelect—a greedy approximation that minimizes the average pairwise Jaccard diversity among basis rows, built on principled pre-filtering and random-restart local search. The authors construct TC-Bench by curating data from major contests, applying the rank-based framework, and selecting a compact, diverse set of 9,347 core wrong codes across 877 problems; they evaluate 13 LLMs, revealing a substantial gap in current test-case generation methods, an inflation tendency in unfiltered benchmarks, and the practical value of the rank-based, inflation-resistant benchmark for guiding future improvements.

Abstract

Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.

Paper Structure

This paper contains 47 sections, 3 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: A comparison of two evaluation frameworks for Augment Test cases (ATs). Both frameworks start from the same raw data (a), which consists of many wrong codes (WCs) and their execution results on Golden Test cases (GTs). (b) The naive evaluation utilizes the full set of WCs and an unprincipled number of ATs, suffers from prohibitive computational costs, and leads to inflated scores. (c) In contrast, our proposed framework first processes this data with WrongSelect to select a compact yet representative diagnostic basis (TC-Bench). Evaluation using this basis is not only highly efficient but also yields more valid scores.
  • Figure 2: An overview of the TC-Bench construction pipeline. It begins with raw data collection, followed by a two-step WrongSelect working on the transformed binary matrix $M$. Step 1 pre-filters the problems with an all- "1" column and removes codes whose rows have too many "1"s. Step 2 samples numerous initial bases $\mathcal{I}_{current}$ from the filtered $M'$ and iteratively minimizes the diversity score by swapping internal and external rows. The best local optimum is chosen to approximate the global optimum. Concurrently, problem descriptions are standardized and correct codes are sampled from the top 20% performers, ensuring the overall quality of TC-Bench.
  • Figure 3: CRUX, PRESUDO, and ALGO construct the output, while LCB and HT depend on the correct code to generate the output.
  • Figure 4: (a) shows a comparison of HackRate between all WCs (before filtering) and TC-Bench. (b) The normalized execution time of the correct code. (c) For each random sampling of 8 correct codes, both PassRate and HackRate remain stable.
  • Figure 5: Results of test case scaling for each model and method. The x-axis represents the number of test cases, scaled as multiples of the problem's rank from 1x to 5x.
  • ...and 8 more figures