ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

Andre He; Nathaniel Weir; Kaj Bostrom; Allen Nie; Darion Cassel; Sam Bayless; Huzefa Rangwala

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, Huzefa Rangwala

TL;DR

The paper tackles limited diversity in synthetic reasoning data for training reasoning LLMs via reinforcement learning with verifiable rewards. It introduces ReSyn, a pipeline that automatically generates diverse reasoning environments by combining instance generators and code-based verifiers, formalized as $\mathcal{T}=(\mathcal{S},\mathcal{A},R,O,\rho_0)$ with $\mathcal{A}=\Sigma^*$ and $R$ evaluated by verifiers. Training a $\text{Qwen2.5-7B-Instruct}$ model on ReSyn data with RLVR yields improvements on BBH (+14%) and BBEH (+27%), and gains extend to GSM8K and AIME, with ablations showing the generator–verifier gap and task diversity as key factors. The work demonstrates that scalable, verifier-guided synthetic data can meaningfully enhance reasoning capabilities in LLMs, offering a practical path to more robust reasoning without manually curated task sets.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

TL;DR

with

and

evaluated by verifiers. Training a

model on ReSyn data with RLVR yields improvements on BBH (+14%) and BBEH (+27%), and gains extend to GSM8K and AIME, with ablations showing the generator–verifier gap and task diversity as key factors. The work demonstrates that scalable, verifier-guided synthetic data can meaningfully enhance reasoning capabilities in LLMs, offering a practical path to more robust reasoning without manually curated task sets.

Abstract

Paper Structure (25 sections, 4 equations, 7 figures, 5 tables)

This paper contains 25 sections, 4 equations, 7 figures, 5 tables.

Introduction
ReSyn Data Pipeline
Problem Formulation
Data Pipeline
Experiments
Reinforcement Learning with Verifiable Rewards
Training Details
Main Results
Big-Bench Hard (BBH)
BigBench Extra Hard (BBEH)
Ablation Studies
Ablation: The Generator-Verifier Gap
Ablation: Scaling Along Tasks vs. Instances
Related Works
Conclusion
...and 10 more sections

Figures (7)

Figure 1: Overview of synthetic environment generation in the ReSyn data pipeline. An LLM is prompted with seed keywords to synthesize Python implementations of reasoning environments, each defining instance generation $\rho_0$, observation $O$, and reward $R$. The generated environment is evaluated by an LLM judge. Ones that pass are added to the ReSyn dataset, while failed ones are revised with feedback and re-evaluated
Figure 2: Example synthetic environment generated by the ReSyn pipeline.
Figure 3: Overview of the ReSyn training pipeline. Left (Instance Generation): Each environment generates instances from $\rho_0$, which are transformed by the observation function $O$ and reward function $R$ into questions and verifiers $(Q, V)$. Right (Reinforcement Learning): A policy model generates candidate solutions for $Q$, which are evaluated by $V$ to provide rewards for model updates.
Figure 3: BigBench Extra Hard overall accuracies (%, micro-average).
Figure 4: Significance test of improvement on BBEH subtasks. Each cell shows the number of tasks where the row model significantly outperforms the column model, with two example tasks. We use paired bootstrap tests ($\alpha=0.01$) for each pair of models.
...and 2 more figures

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

TL;DR

Abstract

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)