Table of Contents
Fetching ...

Synthetic Data RL: Task Definition Is All You Need

Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, Yikang Shen

TL;DR

Synthetic Data RL presents a scalable framework to adapt foundation models to specialized domains using only task-definition-derived synthetic data. By integrating knowledge-guided generation, a difficulty-aware curriculum, and high-potential sample selection within a GRPO training loop, it achieves strong results across math, reasoning, and domain-specific datasets, often matching or exceeding RL with human data under the same budget. The approach reduces human supervision to a task description and demonstrates robustness across base/instructor models, while revealing key dependencies on base model capabilities and data design. This method offers a practical path to cost-efficient, large-scale domain adaptation of foundation models with RL techniques.

Abstract

Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

Synthetic Data RL: Task Definition Is All You Need

TL;DR

Synthetic Data RL presents a scalable framework to adapt foundation models to specialized domains using only task-definition-derived synthetic data. By integrating knowledge-guided generation, a difficulty-aware curriculum, and high-potential sample selection within a GRPO training loop, it achieves strong results across math, reasoning, and domain-specific datasets, often matching or exceeding RL with human data under the same budget. The approach reduces human supervision to a task description and demonstrates robustness across base/instructor models, while revealing key dependencies on base model capabilities and data design. This method offers a practical path to cost-efficient, large-scale domain adaptation of foundation models with RL techniques.

Abstract

Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.

Paper Structure

This paper contains 36 sections, 10 equations, 26 figures, 4 tables, 1 algorithm.

Figures (26)

  • Figure 1: High-level overview for Synthetic Data RL.
  • Figure 2: Comparison of PPO and GRPO: Green shows GRPO with human data, red shows GRPO with synthetic data, and blue shows PPO with synthetic data. The Y-axis indicates accuracy.
  • Figure 3: Pass rate histograms for GSM8k, LogiQA and MedQA.
  • Figure 4: One example for three task instructions $\mathcal{I}_{\textit{des}}$, $\mathcal{I}_{\textit{input}}$, $\mathcal{I}_{\textit{output}}$
  • Figure 5: The keyword extraction prompt
  • ...and 21 more figures