Table of Contents
Fetching ...

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang

TL;DR

This paper introduces CURE, a co-evolving reinforcement learning framework that jointly trains a code generator and a unit-test generator without ground-truth code supervision. By deriving a theoretically grounded reward for unit tests and a co-evolution objective, CURE improves coding performance (ReasonFlux-Coder up to +5.3% in one-shot accuracy and BoN +9.0%), enhances unit-test generation, and enables test-time scaling and agentic coding. It also enables the unit tester to function as a reward model for RL on base models, and it achieves efficiency gains, notably for long-CoT setups, reducing unit-test generation length while maintaining or improving accuracy. The results, across five benchmarks and multiple pipelines, demonstrate the practical impact of self-supervised co-evolution for scalable, robust, and cost-efficient AI-assisted coding and testing.

Abstract

We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

TL;DR

This paper introduces CURE, a co-evolving reinforcement learning framework that jointly trains a code generator and a unit-test generator without ground-truth code supervision. By deriving a theoretically grounded reward for unit tests and a co-evolution objective, CURE improves coding performance (ReasonFlux-Coder up to +5.3% in one-shot accuracy and BoN +9.0%), enhances unit-test generation, and enables test-time scaling and agentic coding. It also enables the unit tester to function as a reward model for RL on base models, and it achieves efficiency gains, notably for long-CoT setups, reducing unit-test generation length while maintaining or improving accuracy. The results, across five benchmarks and multiple pipelines, demonstrate the practical impact of self-supervised co-evolution for scalable, robust, and cost-efficient AI-assisted coding and testing.

Abstract

We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

Paper Structure

This paper contains 39 sections, 2 theorems, 33 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider a ground truth unit test $u_k$, a correct solution $s_{j_1}$, and an incorrect solution $s_{j_2}$. The precision based on a single ground truth test is given by However, when using the aggregated reward defined in Equation inferencereward, we have $P(\mathcal{R}_{s_{j_1}} > \mathcal{R}_{s_{j_2}}) \to 1$ as $m \to \infty$, if and only if $\mu > 0$, where Moreover, under this condition, t

Figures (7)

  • Figure 1: Performance of ReasonFlux-Coder-7B, trained with CURE on only 4.5K coding problems, surpasses models that are specifically fine-tuned on large-scale coding data. We generate 16 candidate solution codes and 16 unit tests, selecting the final solution as the one that passes the most generated unit tests, which is a BoN strategy.
  • Figure 2: (a). This is an example of a problem description along with three task-derived generated unit tests. The first unit test is incorrect, although it is easily produced due to strong hallucination. The second unit test is correct but naive, allowing some incomplete or unthoughtful code to pass. The final unit test is both correct and non-naive, though generating such a test is much easier than actually solving the full coding problem. (b–d) Co-evolving process: (b) unit test accuracy, (c) code accuracy, and (d) estimated reward versus number of steps. (e-f). The Long CoT unit tester becomes increasingly efficient in reasoning as the response length decreases during optimization.
  • Figure 3: Method Pipeline Overview. In our RL framework, for each task, we generate a batch of unit tests and code solutions, along with some ground-truth unit tests. Using these, we construct an execution table. From this table, we extract rewards for each unit test (Equation \ref{['R_u_k']}) and code response (Equation \ref{['R_s_j']}). For the long-CoT model, we apply a transformation on the reward to ensure efficiency. Then we optimize both the unit tester and the coder iteratively.
  • Figure 4: The BoN performance improvement after optimization on base model. Four curves (left to right) show sampling 2, 4, 8, and 16 generated codes; each curve’s five points represent 1, 2, 4, 8, and 16 generated unit tests.
  • Figure 5: The BoN performance improvement across benchmarks when using ReasonFlux-Coder-4B as unit tester. Four curves (left to right) show sampling 2, 4, 8, and 16 generated codes; each curve’s five points represent 1, 2, 4, 8, and 16 generated unit tests.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • proof
  • Proposition A.1
  • proof