HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning
Yujian Liu, Jiabao Ji, Yang Zhang, Wenbo Guo, Tommi Jaakkola, Shiyu Chang
TL;DR
HarnessLLM tackles the limitations of input–output based testing by training LLMs to generate executable test harnesses that synthesize inputs and validate outputs. It uses a two-stage approach—supervised fine-tuning followed by reinforcement learning with a verifiable outcome reward—to produce harnesses that can reveal bugs where ground-truth programs pass but buggy ones fail. Across MBPP+, LiveCodeBench, and Codeforces, HarnessLLM demonstrates superior bug-finding, greater testing strategy diversity, and better generalization to unseen models, while enabling test-time scaling to improve code generation. This execution-based testing paradigm offers a practical boost to the reliability of AI-assisted programming and debugging workflows.
Abstract
Existing LLM-based automatic test generation methods mainly produce input and expected output pairs to categorize the intended behavior of correct programs. Although straightforward, these methods have limited diversity in generated tests and cannot provide enough debugging information. We propose HarnessLLM, a two-stage training pipeline that enables LLMs to write harness code for testing. Particularly, LLMs generate code that synthesizes inputs and validates the observed outputs, allowing complex test cases and flexible output validation such as invariant checking. To achieve this, we train LLMs with SFT followed by RLVR with a customized reward design. Experiments show that HarnessLLM outperforms input-output-based testing in bug finding and testing strategy diversity. HarnessLLM further benefits the code generation performance through test-time scaling with our generated test cases as inference-phase validation. Our code is available at https://github.com/UCSB-NLP-Chang/HarnessLLM.git.
