Table of Contents
Fetching ...

Learning to Generate Unit Test via Adversarial Reinforcement Learning

Dongjun Lee, Changho Hwang, Kimin Lee

TL;DR

This paper tackles the challenge of automatically generating high-quality unit tests for code produced by humans or LLMs. It introduces UTRL, an adversarial reinforcement learning framework that co-trains a unit test generator and a code generator: the unit test generator is rewarded for producing tests that discriminate near-correct code from ground-truth solutions, while the code generator is rewarded for producing code that passes these tests. The approach eliminates the need for ground-truth unit-test annotations and demonstrates superior unit-test quality and competitive code-generation performance on the TACO dataset, outperforming supervised baselines and frontier models like GPT-4.1. By leveraging an iterative co-evolution and carefully designed reward signals, UTRL offers a scalable path to robust unit-test generation and has potential to extend to broader software engineering tasks.

Abstract

Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.

Learning to Generate Unit Test via Adversarial Reinforcement Learning

TL;DR

This paper tackles the challenge of automatically generating high-quality unit tests for code produced by humans or LLMs. It introduces UTRL, an adversarial reinforcement learning framework that co-trains a unit test generator and a code generator: the unit test generator is rewarded for producing tests that discriminate near-correct code from ground-truth solutions, while the code generator is rewarded for producing code that passes these tests. The approach eliminates the need for ground-truth unit-test annotations and demonstrates superior unit-test quality and competitive code-generation performance on the TACO dataset, outperforming supervised baselines and frontier models like GPT-4.1. By leveraging an iterative co-evolution and carefully designed reward signals, UTRL offers a scalable path to robust unit-test generation and has potential to extend to broader software engineering tasks.

Abstract

Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.

Paper Structure

This paper contains 44 sections, 6 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of UTRL. The unit test generator is trained to generate unit test that detect fault in code generated by the code generator, and the code generator is trained to produce code that passes the generated unit test.
  • Figure 2: Illustration of input and desired output of unit test generator LLM. Given a programming instruction specifying input arguments and functionality of a code, the unit test generator LLM generates a set of $N$ test cases comprehensively covering the various edge cases based on reasoning.
  • Figure 3: Overview of the process of computing discrimination reward with respect to unit test $\mathcal{T}$. First, among the $5$ test cases in the unit test $\mathcal{T}$, test cases that pass under the ground-truth code $C^*$ (i.e., $T_1, T_3, T_5$) are filtered, forming a set of functionally valid test cases. Second, the discrimination reward is defined as a ratio of sampled code solutions that do not pass at least one valid test case. In this figure, among 6 sampled code solutions, 4 code solutions ($C_1, C_3, C_4, C_5$) do not pass at least one valid test case, resulting in the discrimination reward of 0.667 ($= \frac{4}{6}$).
  • Figure 4: Fidelity of unit tests generated by Qwen3-4B and Qwen3-14B trained with UTRL, compared against baselines. Both model trained via UTRL achieves the unit test fidelity higher than baselines, demonstrating the effectiveness of UTRL in training LLMs to generate unit tests that closely approximates the code evaluation induced by the GT unit tests.
  • Figure 5: Best-of-$N$ improvmenet (left) and unit test fidelity (right) achieved by UTRL, compared against SFT baselines (SFT w/ $\mathcal{D}_\text{UT}$, SFT w/ $\mathcal{D}_\text{reason+UT}$).
  • ...and 6 more figures