Table of Contents
Fetching ...

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tianxing He

TL;DR

TCGBench introduces a principled benchmark to evaluate LLMs on generating CP test-case generators, distinguishing between valid generators and targeted ones that reveal human-code faults. By combining CP problem sources (NOIP and Canonical), multiple standard and erroneous solvers, and a targeted-instruction dataset, the study shows LLMs can produce valid generators with enough samples but struggle to generate effective targeted generators, even with advanced reasoning models. A curated target-instruction dataset and LoRA-based fine-tuning improve targeted-generation performance, though gaps remain relative to human experts and issues like over-sensitivity and time-complexity misestimates persist. The work provides a rigorous framework for assessing LLM reasoning about code and offers practical insights for improving automated verification of generated code in competitive programming contexts.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

TL;DR

TCGBench introduces a principled benchmark to evaluate LLMs on generating CP test-case generators, distinguishing between valid generators and targeted ones that reveal human-code faults. By combining CP problem sources (NOIP and Canonical), multiple standard and erroneous solvers, and a targeted-instruction dataset, the study shows LLMs can produce valid generators with enough samples but struggle to generate effective targeted generators, even with advanced reasoning models. A curated target-instruction dataset and LoRA-based fine-tuning improve targeted-generation performance, though gaps remain relative to human experts and issues like over-sensitivity and time-complexity misestimates persist. The work provides a rigorous framework for assessing LLM reasoning about code and offers practical insights for improving automated verification of generated code in competitive programming contexts.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

Paper Structure

This paper contains 36 sections, 4 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: An example of our benchmark. It consists of two tasks: generation of valid test case generators and generation of targeted test case generators. To enhance clarity, the example has been presented in a simplified format. For the full example, please refer to Appendix \ref{['sec:app_examples']}.
  • Figure 2: Valid@k results. When $k\ge5$, o1 demonstrated capabilities exceed human level. However, at $k=1$, the capabilities of most models still fall short of human level.
  • Figure 3: Success@k results. Most models have a poor performance at success@1. However, when the number of samples increases, success@k shows a significant improvement. This suggests that a decent level of ability to generate targeted generators, but requires multiple tries.
  • Figure 4: Success@k in Targeted Instruction Dataset. With the help of the target instructions, all 3 models show a significant improvement in success@1.
  • Figure 5: Success@k in the Canonical Problem Dataset. The last two groups show the results before and after fine-tuning Qwen2.5-14B. It can be observed that the fine-tuned Qwen2.5-14B achieves capabilities close to GPT-4o-mini.
  • ...and 17 more figures