Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems
Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tianxing He
TL;DR
TCGBench introduces a principled benchmark to evaluate LLMs on generating CP test-case generators, distinguishing between valid generators and targeted ones that reveal human-code faults. By combining CP problem sources (NOIP and Canonical), multiple standard and erroneous solvers, and a targeted-instruction dataset, the study shows LLMs can produce valid generators with enough samples but struggle to generate effective targeted generators, even with advanced reasoning models. A curated target-instruction dataset and LoRA-based fine-tuning improve targeted-generation performance, though gaps remain relative to human experts and issues like over-sensitivity and time-complexity misestimates persist. The work provides a rigorous framework for assessing LLM reasoning about code and offers practical insights for improving automated verification of generated code in competitive programming contexts.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
