RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning
Aleksei Arzhantsev, Otmane Sakhi, Flavian Vasile
TL;DR
This work addresses the scalability bottleneck of enhancing LLM reasoning with reinforcement learning that typically relies on ground-truth rewards. It introduces RoiRL, an offline iterative reinforcement-learning framework that builds offline candidate sets and optimizes a weighted log-likelihood using reward transforms $g_m$, yielding a policy $\pi_m$ that evolves as $\pi_m(c,y|x) \propto \left(\prod_{j=1}^m g_j(\tilde{r}_k(y,x,\theta_{j-1}))\right) \pi_0(c,y|x)$; importantly, it can recover the same KL-regularized solution as TTRL with an appropriate choice of $g_m$, while offering greater stability and fast, memory-efficient training. Empirically, RoiRL outperforms TTRL on math-reasoning benchmarks (MATH500, AMC, AIME), achieving up to 2.5x faster training and better generalization without relying on true labels. These results demonstrate a practical path toward self-improving LLMs that scales to larger models and budgets by leveraging offline, self-generated feedback. The proposed framework thus offers a substantive step toward label-free, scalable reasoning improvements.
Abstract
Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.
