Table of Contents
Fetching ...

RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

Aleksei Arzhantsev, Otmane Sakhi, Flavian Vasile

TL;DR

This work addresses the scalability bottleneck of enhancing LLM reasoning with reinforcement learning that typically relies on ground-truth rewards. It introduces RoiRL, an offline iterative reinforcement-learning framework that builds offline candidate sets and optimizes a weighted log-likelihood using reward transforms $g_m$, yielding a policy $\pi_m$ that evolves as $\pi_m(c,y|x) \propto \left(\prod_{j=1}^m g_j(\tilde{r}_k(y,x,\theta_{j-1}))\right) \pi_0(c,y|x)$; importantly, it can recover the same KL-regularized solution as TTRL with an appropriate choice of $g_m$, while offering greater stability and fast, memory-efficient training. Empirically, RoiRL outperforms TTRL on math-reasoning benchmarks (MATH500, AMC, AIME), achieving up to 2.5x faster training and better generalization without relying on true labels. These results demonstrate a practical path toward self-improving LLMs that scales to larger models and budgets by leveraging offline, self-generated feedback. The proposed framework thus offers a substantive step toward label-free, scalable reasoning improvements.

Abstract

Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.

RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

TL;DR

This work addresses the scalability bottleneck of enhancing LLM reasoning with reinforcement learning that typically relies on ground-truth rewards. It introduces RoiRL, an offline iterative reinforcement-learning framework that builds offline candidate sets and optimizes a weighted log-likelihood using reward transforms , yielding a policy that evolves as ; importantly, it can recover the same KL-regularized solution as TTRL with an appropriate choice of , while offering greater stability and fast, memory-efficient training. Empirically, RoiRL outperforms TTRL on math-reasoning benchmarks (MATH500, AMC, AIME), achieving up to 2.5x faster training and better generalization without relying on true labels. These results demonstrate a practical path toward self-improving LLMs that scales to larger models and budgets by leveraging offline, self-generated feedback. The proposed framework thus offers a substantive step toward label-free, scalable reasoning improvements.

Abstract

Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.

Paper Structure

This paper contains 15 sections, 3 theorems, 34 equations, 4 figures, 3 tables, 3 algorithms.

Key Result

Proposition 3.0

For any $\beta > 0$, there exists a choice of the reward transforms $(g_m)_{m\in\mathbb{N}}$ such that Equation eq:kl_regularised and Algorithm alg:RoiRL admit the same solution.

Figures (4)

  • Figure 1: Training curves for Qwen-2.5-Math
  • Figure 2: Training curves for Phi-4
  • Figure 3: Training curves for Llama-3.2
  • Figure 4: Entropies for Qwen2.5 on MATH500

Theorems & Definitions (6)

  • Proposition 3.0
  • Lemma A.1
  • proof
  • proof
  • Proposition A.1
  • proof