Table of Contents
Fetching ...

iGRPO: Self-Feedback-Driven LLM Reasoning

Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman, Ximing Lu, Seungju Han, Wei Ping, Yejin Choi, Jan Kautz

TL;DR

This work tackles the challenge of verifiable multi-step mathematical reasoning in LLMs by introducing Iterative Group Relative Policy Optimization (iGRPO), a two-stage self-conditioning RL framework. Stage 1 explores multiple drafts and selects the highest-reward draft as feedback, while Stage 2 conditions on this best draft to refine following generations using a GRPO-style update, enabling bootstrapped improvement without altering the reward signal. Empirically, iGRPO yields consistent gains across model sizes (7B, 8B, 14B) and datasets, including state-of-the-art results on AIME24/AIME25 when trained on AceReason-Math, and demonstrates transfer to broader reasoning tasks with generalization to stronger bases. Ablation studies show the refinement wrapper benefits other group-based PPO variants, adapts to richer evaluators like generative judges, and alters learning dynamics by delaying entropy collapse, supporting the practicality of iterative self-feedback for advanced mathematical reasoning. The results indicate that simple, self-guided refinement can meaningfully improve verifiable reasoning performance with only modest computational overhead, suggesting broad applicability in scalable LLM alignment for reasoning tasks.

Abstract

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

iGRPO: Self-Feedback-Driven LLM Reasoning

TL;DR

This work tackles the challenge of verifiable multi-step mathematical reasoning in LLMs by introducing Iterative Group Relative Policy Optimization (iGRPO), a two-stage self-conditioning RL framework. Stage 1 explores multiple drafts and selects the highest-reward draft as feedback, while Stage 2 conditions on this best draft to refine following generations using a GRPO-style update, enabling bootstrapped improvement without altering the reward signal. Empirically, iGRPO yields consistent gains across model sizes (7B, 8B, 14B) and datasets, including state-of-the-art results on AIME24/AIME25 when trained on AceReason-Math, and demonstrates transfer to broader reasoning tasks with generalization to stronger bases. Ablation studies show the refinement wrapper benefits other group-based PPO variants, adapts to richer evaluators like generative judges, and alters learning dynamics by delaying entropy collapse, supporting the practicality of iterative self-feedback for advanced mathematical reasoning. The results indicate that simple, self-guided refinement can meaningfully improve verifiable reasoning performance with only modest computational overhead, suggesting broad applicability in scalable LLM alignment for reasoning tasks.

Abstract

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.
Paper Structure (55 sections, 1 theorem, 48 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 55 sections, 1 theorem, 48 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Proposition 3.1

Assume the reward is binary, $R_\phi(o) \in \{0,1\}$, and Stage 1 drafts $\{d_i\}_{i=1}^N$ are sampled i.i.d. from $\pi_\theta(\cdot|q)$. Let denote the expected reward under policy $\pi_\theta$, which equals the success probability $p_\theta(q) = \Pr[R_\phi(o)=1]$ in the binary case. Then the expected reward of the selected best draft $\hat{d}_\theta(q) = \arg\max_i R_\phi(d_i)$ satisfies which

Figures (5)

  • Figure 1: Iterative GRPO (iGRPO): During Exploratory Draft Generation, the model selects a high-scoring "best draft" from initial samples and appends it to the prompt for Conditioned Refinement. This augmented context guides the generation of new group-based updates, creating a bootstrapping effect where the policy continuously improves its own conditioning signal to enhance reasoning.
  • Figure 2: Pass@1 results for OpenReasoning-Nemotron-7B with and without iGRPO. Improvements appear not only on math but also on general reasoning tasks such as MMLU-Pro and GPQA.
  • Figure 3: Entropy dynamics. Per-token policy entropy during training. iGRPO maintains higher mid-training entropy than GRPO, indicating sustained exploration before convergence.
  • Figure S.1: Performance of iOpenMath-Nemotron-14B across various pass@N settings for AIME24 and AIME25. Both benchmarks exhibit increasing accuracy with higher $N$, though AIME24 quickly stabilizes at 93.33% by $N=16$, whereas AIME25 continues to rise until reaching 96.67% at $N=256$.
  • Figure S.2: Comparison of (a) average training rewards and (b) response lengths for GRPO vs. iGRPO.

Theorems & Definitions (3)

  • Definition 1: Self-Conditioned Prompt Construction
  • Proposition 3.1: Progressive Conditioning Quality for Binary Rewards
  • proof