Table of Contents
Fetching ...

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

Sicheng Zhu, Jiajun Wang, Jiawei Ai, Xin Li

TL;DR

This work introduces a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions, and shows that MIST-RL outperforms state-of-the-art baselines.

Abstract

Large Language Models (LLMs) often fail to generate correct code on the first attempt, which requires using generated unit tests as verifiers to validate the solutions. Despite the success of recent verification methods, they remain constrained by a "scaling-by-quantity" paradigm. This brute-force approach suffers from a critical limitation: it yields diminishing returns in fault detection while causing severe test redundancy. To address this, we propose MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning), a framework that shifts the focus to "scaling-by-utility". We formulate test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Specifically, we introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions. Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines. It achieves a +28.5% higher mutation score while reducing the number of test cases by 19.3%. Furthermore, we show that these compact, high-utility tests serve as superior verifiers, which improves downstream code reranking accuracy on HumanEval+ by 3.05% over the SOTA baseline with 10 candidate samples. The source code and data are provided in the supplementary material.

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

TL;DR

This work introduces a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions, and shows that MIST-RL outperforms state-of-the-art baselines.

Abstract

Large Language Models (LLMs) often fail to generate correct code on the first attempt, which requires using generated unit tests as verifiers to validate the solutions. Despite the success of recent verification methods, they remain constrained by a "scaling-by-quantity" paradigm. This brute-force approach suffers from a critical limitation: it yields diminishing returns in fault detection while causing severe test redundancy. To address this, we propose MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning), a framework that shifts the focus to "scaling-by-utility". We formulate test generation as a sequential decision process optimized via Group Relative Policy Optimization (GRPO). Specifically, we introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions. Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines. It achieves a +28.5% higher mutation score while reducing the number of test cases by 19.3%. Furthermore, we show that these compact, high-utility tests serve as superior verifiers, which improves downstream code reranking accuracy on HumanEval+ by 3.05% over the SOTA baseline with 10 candidate samples. The source code and data are provided in the supplementary material.
Paper Structure (42 sections, 9 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation: Quality Over Quantity.(a) Existing "scaling-by-quantity" methods (Blue/Green lines), including the SOTA CodeRM-8B and the larger Qwen3-14B, exhibit rapid logarithmic saturation in Fault Detection Capability, indicating severe Semantic Redundancy (shaded area). In contrast, MIST-RL (Orange) maintains a steep growth trajectory, creating a significant Utility Gap. (b) This utility translates directly to downstream effectiveness. MIST-RL solves more problems (% Problems Solved) with fewer test suites compared to baselines, validating that optimizing for marginal utility is more efficient than brute-force scaling.
  • Figure 2: Illustration of the Incremental Reward Mechanism. The agent receives positive rewards only for killing new mutants (grey circles). Redundant tests (e.g., T3) that fail to reduce the surviving mutant pool trigger a dynamic penalty ($-\rho$).
  • Figure 3: Marginal Utility Analysis. The figures visualize fault-detection efficiency across datasets. MIST-RL (Red) demonstrates a significantly steeper utility curve compared to CodeRM-8B (Blue), confirming that our approach efficiently prioritizes high-utility test cases early in the generation process.
  • Figure 4: Ablation Study on HumanEval+. (a) Effectiveness: The Incremental Reward is essential for high mutation scores. (b) Efficiency: The Dynamic Penalty significantly reduces test suite length (preventing bloat) while maintaining performance.
  • Figure 5: Case Study on move_one_ball. CodeRM generates complex but redundant tests that miss the "shift-by-one" boundary. MIST-RL generates the minimal counter-example [2, 1], successfully killing the loop-range mutant.
  • ...and 1 more figures