Table of Contents
Fetching ...

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi

TL;DR

<3-5 sentence high-level summary> RLVE addresses the plateau in language-model reinforcement learning caused by static problem distributions by introducing Adaptive Verifiable Environments that procedurally generate verifiable problems and adjust difficulty to match the LM's capabilities. RLVE-Gym expands to 400 environments, and joint training across these adaptive environments yields substantial gains across six reasoning benchmarks, outperforming longer or heavier RL training on existing datasets. The results demonstrate that scaling RL training along the environment dimension, rather than merely increasing data volume or compute, improves both in-distribution learning and out-of-distribution generalization. By releasing the code and a large suite of environments, the work promotes broader adoption of adaptive environment engineering as a scalable paradigm for LM RL.

Abstract

We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

TL;DR

<3-5 sentence high-level summary> RLVE addresses the plateau in language-model reinforcement learning caused by static problem distributions by introducing Adaptive Verifiable Environments that procedurally generate verifiable problems and adjust difficulty to match the LM's capabilities. RLVE-Gym expands to 400 environments, and joint training across these adaptive environments yields substantial gains across six reasoning benchmarks, outperforming longer or heavier RL training on existing datasets. The results demonstrate that scaling RL training along the environment dimension, rather than merely increasing data volume or compute, improves both in-distribution learning and out-of-distribution generalization. By releasing the code and a large suite of environments, the work promotes broader adoption of adaptive environment engineering as a scalable paradigm for LM RL.

Abstract

We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

Paper Structure

This paper contains 34 sections, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: (a) During RL training, some array-sorting problems that were appropriately challenging become too easy, while others that were too hard become learnable as the policy improves (given the upward movement of the dark region containing many problems for which some rollouts are correct, and others are not). (b) RLVE trains an LM on verifiable environments that dynamically adjust problem difficulty based on its performance over time. (c) Starting from ProRL-1.5B-v2hu2025prorlv2, continuing training with RLVE yields a 3.37% absolute average improvement across six reasoning benchmarks, whereas continuing the original RLVR training achieves a 0.49% average absolute gain using more than 3$\times$ the compute.
  • Figure 2: Illustration of adaptive difficulty enabled by RLVE when training a policy model $\pi$ on the Sorting environment. Shown are the adaptive difficulty level $h_\pi$ and the model $\pi$’s accuracy on problems generated from this level at each step. Whenever the accuracy exceeds the threshold $\tau_{\mathrm{acc}}$ (90%), RLVE increments $h_\pi$ by 1, shifting the difficulty distribution to harder problems.
  • Figure 3: Comparison of RLVE (using dynamically adjusted difficulty range) against three types of static difficulty ranges. (a) reports the effective prompt ratio, defined as the percentage of prompts retained after dynamic sampling whose rollouts yield non-identical rewards; a higher ratio indicates fewer wasted rollouts and thus generally better learning efficiency. (b) shows in-distribution (ID) accuracies on the same training environment, and (c) shows out-of-distribution (OOD) accuracies on the 50 held-out verifiable environments. Adaptive difficulty maintains the highest effective prompt ratio and achieves superior ID and OOD performance, whereas static difficulty suffers from either early saturation or inefficient learning.
  • Figure 4: (a) shows the frequency distribution of the upper-bound difficulty levels $h_\pi^{(i)}$ reached by adaptive environments at step 400. (b) compares training jointly on 256 environments with adaptive versus static difficulty distributions. Despite covering all adaptive environments’ distributions, training on the static environments consistently underperforms.
  • Figure 5: Comparison of RLVE with joint training on collections of four different sizes of verifiable environments, all under identical training setups. Each larger collection strictly contains all smaller ones. Shown are the accuracies on 50 held-out verifiable environments throughout training. Expanding the collection of training environments consistently leads to better performance on held-out environments (unseen during training) across all model types.
  • ...and 4 more figures