Table of Contents
Fetching ...

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Zeng Zhiyuan, Jiashuo Liu, Zhangyue Yin, Ge Zhang, Wenhao Huang, Xipeng Qiu

TL;DR

RLoop identifies RLVR overfitting as a key challenge where training rewards rise but generalization stalls due to catastrophic forgetting and underutilization of inter-step policy diversity. It introduces an iterative, self-improving loop that alternates RL exploration from a base policy with a rejection-sampling fine-tuning (RFT) exploitation step, using a curated expert dataset generated entirely from its own trajectories. The approach yields substantial generalization gains, notably improving pass@k metrics and training stability while reducing forgetting, and it scales positively with more iterations. The results suggest that leveraging inter-step policy diversity through cyclical re-initialization can make强化 reasoning models more robust and generalizable in non-differentiable reward settings.

Abstract

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

TL;DR

RLoop identifies RLVR overfitting as a key challenge where training rewards rise but generalization stalls due to catastrophic forgetting and underutilization of inter-step policy diversity. It introduces an iterative, self-improving loop that alternates RL exploration from a base policy with a rejection-sampling fine-tuning (RFT) exploitation step, using a curated expert dataset generated entirely from its own trajectories. The approach yields substantial generalization gains, notably improving pass@k metrics and training stability while reducing forgetting, and it scales positively with more iterations. The results suggest that leveraging inter-step policy diversity through cyclical re-initialization can make强化 reasoning models more robust and generalizable in non-differentiable reward settings.

Abstract

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

Paper Structure

This paper contains 29 sections, 7 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The reward, accuracy and pass@32 score of Qwen-2.5-math-7b trained with the DAPO algorithm evaluated on AIME-2024.
  • Figure 2: The value at ($i$, $j$) in the Learning Matrix represents the percentage of validation problems that the policy at step $j$ can solve but the policy at step $i$ cannot. Conversely, the value in the Forgetting Matrix represents problems solvable at step $i$ but not at step $j$. The value at ($i$, $j$) in the Similarity Matrix indicates the average n-gram similarity of trajectories between policies from step $i$ and step $j$. For all analyses, we sample 32 solutions for each question in the validation set.
  • Figure 3: The performance of Qwen-2.5-7b-Math trained with RLoop in different number of iterations, in terms of accuracy and pass@k score.
  • Figure 4: Compare the accuracy of vanilla RL an RLoop at different training steps.
  • Figure 5: (a): Analysis of RLoop's mechanisms compared to vanilla RL. (a) Differential forgetting matrix (Vanilla RL Forgetting - RLoop Forgetting). Blue indicates RLoop forgets less. (b) N-gram similarity comparison, where lower values imply higher diversity. (c) Token-level policy entropy over training steps.
  • ...and 1 more figures