Table of Contents
Fetching ...

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Speed Zhu, Jianwei Cai, Guang Chen, Lulu Wu, Saiyong Yang, Wiggin Zhou

TL;DR

This work addresses the challenge of RLVR for competitive-programming code generation by prioritizing data curation and curriculum design alongside algorithmic advances. It introduces a two-stage RL framework: first, entropy expansion to broaden exploration, followed by a hard-focus curriculum (Pre-GRPO) that concentrates on the most difficult problems with large rollout budgets. Empirical results on Qwen2.5-32B show state-of-the-art performance among similarly sized models and competitive parity with larger systems, with substantial gains on hard benchmarks like Codeforces. The study also provides scaling insights in a large MoE setting and distills practical best practices for data curation, curriculum design, and resource allocation in RLVR for competitive programming, offering a roadmap for future high-performance, data-aware RLVR work.

Abstract

Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform \textbf{Pre-GRPO}: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

TL;DR

This work addresses the challenge of RLVR for competitive-programming code generation by prioritizing data curation and curriculum design alongside algorithmic advances. It introduces a two-stage RL framework: first, entropy expansion to broaden exploration, followed by a hard-focus curriculum (Pre-GRPO) that concentrates on the most difficult problems with large rollout budgets. Empirical results on Qwen2.5-32B show state-of-the-art performance among similarly sized models and competitive parity with larger systems, with substantial gains on hard benchmarks like Codeforces. The study also provides scaling insights in a large MoE setting and distills practical best practices for data curation, curriculum design, and resource allocation in RLVR for competitive programming, offering a roadmap for future high-performance, data-aware RLVR work.

Abstract

Recent reasoning-first models (e.g., OpenAI o1, DeepSeek R1) have spurred a resurgence of interest in RLVR. Nevertheless, advances are dominated by mathematics (e.g., AIME), with competitive-programming code generation underexplored and data curation receiving less attention than RL algorithm design. We investigate how to construct RLVR datasets (i.e., RL prompts) and present practical training techniques that yield strong performance on competitive-programming code generation. Our pipeline begins with supervised fine-tuning (SFT) distilled from strong open-source models, augmented with general-purpose and reasoning-intensive data. RL then follows a two-stage process with executable, testcase-driven rewards: first, training on a large, uniformly distributed set of competitive-programming problems using Group Relative Policy Optimization (GRPO) with 8 rollouts per prompt and a relatively short response-generation window (e.g., 32k during SFT and 24k in this stage) to expand entropy and mitigate repetition and truncation; second, we perform \textbf{Pre-GRPO}: updating on a small, high-quality set of challenging problems with a large rollout budget (64 rollouts per prompt) under a hard-focus curriculum that continuously retains the most difficult instances throughout training. We implement our method on Qwen2.5-32B and evaluate on LeetCode and Codeforces weekly contests to avoid data leakage. The resulting model achieves state-of-the-art performance among models of similar scale and is comparable to leading systems such as DeepSeek v3.1 and Doubao-1.5-Thinking. We also examine scaling trends and observe strong RL scaling on an internal large-scale MoE model. Our study distills concise best practices for data curation, entropy expansion, and curriculum design in RLVR for competitive-programming code generation.

Paper Structure

This paper contains 16 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Performance of our models on various benchmark
  • Figure 2: The training pipeline of our models.
  • Figure 3: The entropy comparison of 24k-style training and 32k-style training
  • Figure 4: The performance of different RL training strategies on LiveCodeV6 and LiveCode08-11 during training
  • Figure 5: The Accuracy Trends by First Apperance Accuracy Clusters
  • ...and 3 more figures