Table of Contents
Fetching ...

Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

Zongqian Li, Shaohan Huang, Zewen Chi, Yixuan Su, Lexin Zhou, Li Dong, Nigel Collier, Furu Wei

TL;DR

This work proposes MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity.

Abstract

Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.

Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

TL;DR

This work proposes MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity.

Abstract

Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts.
Paper Structure (18 sections, 2 equations, 9 figures, 1 table)

This paper contains 18 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Cross-Model Training Dynamics. Performance and response length across Qwen 2.5 and Qwen 3 models, illustrating generation-specific training behaviors and output characteristics.
  • Figure 2: Truncation Masking Effects on Training. Performance trends under different masking strategies, comparing no mask, complete masking, and conditional masking at various rates.
  • Figure 3: Temperature Dynamics and Scheduling Analysis. Temperature robustness trends during training showing increasing stability at higher temperatures, convergence of output diversity across different temperature settings, training failure when low temperatures cause initial diversity to fall below convergence values, and better performance of dynamic temperature scheduling (low-to-high transition) compared to static temperature approaches.
  • Figure 4: Influence of KL Loss and Clip Ratio on Training Dynamics. Performance comparison between standard KL loss and removed KL loss with high clipping, illustrating the relationship between diversity maintenance and sustained performance improvement.
  • Figure 5: Dataset Quality Comparison. Training dynamics comparing MicroCoder and DeepCoder datasets across accuracy, critic reward, and response length metrics, demonstrating learning effectiveness on challenging problems.
  • ...and 4 more figures