REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf
TL;DR
Reasoning Gym (RG) tackles data scarcity in reinforcement learning with verifiable rewards (RLVR) by providing a library of procedurally generated reasoning tasks with automatic verifiers across algebra, algorithms, logic, cognition, games, and geometry. Its design emphasizes algorithmic verifiability, large solution spaces, and controllable difficulty to enable dynamic curricula and unlimited data without memorization. Empirical results show that RLVR training on RG tasks improves intra-domain and cross-domain reasoning capabilities and yields transfer to external benchmarks like GSM8K, MATH, Big-Bench Hard, and MMLU-Pro, while zero-shot evaluations expose persistent gaps for non-reasoning baselines and highlight a pronounced difficulty cliff as task difficulty increases. The work demonstrates RG’s potential as both a rigorous evaluation framework and a scalable training ground for robust, generalizable reasoning in language models, and it releases the full library as an open-source resource. $\text{Key contributions include}$: a versatile, verifiable, procedurally generated task suite; comprehensive zero-shot and transfer analyses; curriculum RLVR demonstrations; and evidence of cross-domain and external benchmark generalization.
Abstract
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.
