Table of Contents
Fetching ...

A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

Cansu Sancaktar, David Zhang, Gabriel Synnaeve, Taco Cohen

Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.

A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.
Paper Structure (33 sections, 23 figures, 4 tables)

This paper contains 33 sections, 23 figures, 4 tables.

Figures (23)

  • Figure 1: Overview of the multi-turn synthetic data pipeline. A seed snippet, sampled from random code or real coding puzzles, serves as inspiration for the teacher. In the first turn, the teacher generates an initial problem according to the current environment’s rules, and the student attempts to solve it multiple times. In later turns, the teacher receives a summary of the student’s performance (pass rate and representative solutions) and adapts the problem accordingly. Invalid or redundant generations are filtered and deduplicated before inclusion in the dataset.
  • Figure 2: Example of multi-turn data generation. The top-left panel shows the seed snippet provided to the teacher, taken from a real coding puzzle. In turn 1, the teacher generates a puzzle with a student pass rate of 0.875 ($M=8$). In turn 2, after observing the student’s performance, the teacher produces a harder variant with a pass rate of 0.25.
  • Figure 3: Scaling with real data in Qwen3-8B Base. We compare RL training on 25K and 81K real coding problems using GRPO (3 seeds). Performance is tracked on in-domain (LCB) and out-of-domain (Math500 and AIME2024) benchmarks throughout training. Performance gains plateau early, indicating limited benefit from scaling real data alone.
  • Figure 4: Synthetic data augmentations in Llama3.1-8B Instruct. RL training on 25K real code-contest problems (baseline) versus 25K real plus 20K synthetic problem augmentations seeded from solved real questions (3 seeds). Synthetic augmentation improves performance across both in-domain (code: LCB) and out-of-domain (math: Math500, AIME2024) benchmarks.
  • Figure 5: Synthetic data augmentations in Qwen3-8B Base. RL training on real code-contest problems (baseline) versus with synthetic problem augmentation (3 seeds). Synthetic data are seeded either with answers to real questions (SYNTH-Real-Aug) or with random code snippets from starcoderdata (SYNTH*-Aug). Performance improves primarily on in-domain (code: LCB), while out-of-domain (math: Math500, AIME2024) benchmark performance remains comparable or slightly lower.
  • ...and 18 more figures