Table of Contents
Fetching ...

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf

TL;DR

Reasoning Gym (RG) tackles data scarcity in reinforcement learning with verifiable rewards (RLVR) by providing a library of procedurally generated reasoning tasks with automatic verifiers across algebra, algorithms, logic, cognition, games, and geometry. Its design emphasizes algorithmic verifiability, large solution spaces, and controllable difficulty to enable dynamic curricula and unlimited data without memorization. Empirical results show that RLVR training on RG tasks improves intra-domain and cross-domain reasoning capabilities and yields transfer to external benchmarks like GSM8K, MATH, Big-Bench Hard, and MMLU-Pro, while zero-shot evaluations expose persistent gaps for non-reasoning baselines and highlight a pronounced difficulty cliff as task difficulty increases. The work demonstrates RG’s potential as both a rigorous evaluation framework and a scalable training ground for robust, generalizable reasoning in language models, and it releases the full library as an open-source resource. $\text{Key contributions include}$: a versatile, verifiable, procedurally generated task suite; comprehensive zero-shot and transfer analyses; curriculum RLVR demonstrations; and evidence of cross-domain and external benchmark generalization.

Abstract

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

TL;DR

Reasoning Gym (RG) tackles data scarcity in reinforcement learning with verifiable rewards (RLVR) by providing a library of procedurally generated reasoning tasks with automatic verifiers across algebra, algorithms, logic, cognition, games, and geometry. Its design emphasizes algorithmic verifiability, large solution spaces, and controllable difficulty to enable dynamic curricula and unlimited data without memorization. Empirical results show that RLVR training on RG tasks improves intra-domain and cross-domain reasoning capabilities and yields transfer to external benchmarks like GSM8K, MATH, Big-Bench Hard, and MMLU-Pro, while zero-shot evaluations expose persistent gaps for non-reasoning baselines and highlight a pronounced difficulty cliff as task difficulty increases. The work demonstrates RG’s potential as both a rigorous evaluation framework and a scalable training ground for robust, generalizable reasoning in language models, and it releases the full library as an open-source resource. : a versatile, verifiable, procedurally generated task suite; comprehensive zero-shot and transfer analyses; curriculum RLVR demonstrations; and evidence of cross-domain and external benchmark generalization.

Abstract

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.

Paper Structure

This paper contains 19 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Example RG tasks from three categories.
  • Figure 2: Frontier models struggle with challenging RG configurations. Reasoning models like o3-mini o3 and DeepSeek-R1 r1 tend to outperform non-reasoning models, but the tasks configured with challenging parameters are still far from being saturated.
  • Figure 3: Model and task difficulty comparison. Left: Zero-shot ability across model types on the hard configs. Right: Impact of dataset difficulty on per-category accuracy. Section \ref{['appendix:zero-shot-configs']} details the easy and hard parameter configurations for each dataset.
  • Figure 4: Rewards of Intra-Domain Generalization RL. There is a sharp increase in reward at the start of training. This is partly attributable to the model quickly learning auxiliary rewards (i.e. formatting) during training, but it is also reflective of how quickly RLVR improves the model's ability to solve training tasks.
  • Figure 5: Rewards of Cross-Domain Generalization RL. Rewards initially spike due to learning the format reward (worth 0.2, with an accuracy reward worth 1.0). The model is then able to learn in all cases, but the differing trajectories and final reward values illustrate that some task categories are more challenging than others.
  • ...and 1 more figures