Table of Contents
Fetching ...

HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning

Yujian Liu, Jiabao Ji, Yang Zhang, Wenbo Guo, Tommi Jaakkola, Shiyu Chang

TL;DR

HarnessLLM tackles the limitations of input–output based testing by training LLMs to generate executable test harnesses that synthesize inputs and validate outputs. It uses a two-stage approach—supervised fine-tuning followed by reinforcement learning with a verifiable outcome reward—to produce harnesses that can reveal bugs where ground-truth programs pass but buggy ones fail. Across MBPP+, LiveCodeBench, and Codeforces, HarnessLLM demonstrates superior bug-finding, greater testing strategy diversity, and better generalization to unseen models, while enabling test-time scaling to improve code generation. This execution-based testing paradigm offers a practical boost to the reliability of AI-assisted programming and debugging workflows.

Abstract

Existing LLM-based automatic test generation methods mainly produce input and expected output pairs to categorize the intended behavior of correct programs. Although straightforward, these methods have limited diversity in generated tests and cannot provide enough debugging information. We propose HarnessLLM, a two-stage training pipeline that enables LLMs to write harness code for testing. Particularly, LLMs generate code that synthesizes inputs and validates the observed outputs, allowing complex test cases and flexible output validation such as invariant checking. To achieve this, we train LLMs with SFT followed by RLVR with a customized reward design. Experiments show that HarnessLLM outperforms input-output-based testing in bug finding and testing strategy diversity. HarnessLLM further benefits the code generation performance through test-time scaling with our generated test cases as inference-phase validation. Our code is available at https://github.com/UCSB-NLP-Chang/HarnessLLM.git.

HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning

TL;DR

HarnessLLM tackles the limitations of input–output based testing by training LLMs to generate executable test harnesses that synthesize inputs and validate outputs. It uses a two-stage approach—supervised fine-tuning followed by reinforcement learning with a verifiable outcome reward—to produce harnesses that can reveal bugs where ground-truth programs pass but buggy ones fail. Across MBPP+, LiveCodeBench, and Codeforces, HarnessLLM demonstrates superior bug-finding, greater testing strategy diversity, and better generalization to unseen models, while enabling test-time scaling to improve code generation. This execution-based testing paradigm offers a practical boost to the reliability of AI-assisted programming and debugging workflows.

Abstract

Existing LLM-based automatic test generation methods mainly produce input and expected output pairs to categorize the intended behavior of correct programs. Although straightforward, these methods have limited diversity in generated tests and cannot provide enough debugging information. We propose HarnessLLM, a two-stage training pipeline that enables LLMs to write harness code for testing. Particularly, LLMs generate code that synthesizes inputs and validates the observed outputs, allowing complex test cases and flexible output validation such as invariant checking. To achieve this, we train LLMs with SFT followed by RLVR with a customized reward design. Experiments show that HarnessLLM outperforms input-output-based testing in bug finding and testing strategy diversity. HarnessLLM further benefits the code generation performance through test-time scaling with our generated test cases as inference-phase validation. Our code is available at https://github.com/UCSB-NLP-Chang/HarnessLLM.git.

Paper Structure

This paper contains 39 sections, 1 equation, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Comparison between input-output pairs (top) and test harness (bottom).
  • Figure 2: Generalization to unseen models. The buggy code is sampled from Qwen3-14B, which is not seen during training.
  • Figure 3: Overview of our training pipeline.
  • Figure 3: Best-of-8 performance on LiveCodeBench where the code is selected based on the execution results of the generated test cases.
  • Figure 4: True bug rate (TBR) and invalid test rate (ITR) as the number of test cases increases.
  • ...and 13 more figures