Table of Contents
Fetching ...

TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback

Lei Pang, Jun Luo, Ruinan Jin

TL;DR

Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure is proposed.

Abstract

Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with group-normalized rewards while retaining PPO-style token-level importance sampling based on an old policy. Our theoretical analysis reveals that the GRPO update rule estimates the policy gradient at the old policy rather than the current one; however, since the old policy is refreshed every few steps, the resulting discrepancy remains small and the induced bias is negligible in practice. To empirically validate this insight, we conduct an ablation study that entirely removes importance sampling and performs multiple optimization steps using gradients estimated at a fixed old policy. Remarkably, this simplified variant attains performance comparable to standard GRPO. Motivated by this finding, we propose Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure. Furthermore, we present the first convergence analysis for GRPO-style methods and show that TIC-GRPO converges faster than GRPO. Finally, empirical results across math reasoning and coding tasks demonstrate the superiority of TIC-GRPO.

TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback

TL;DR

Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure is proposed.

Abstract

Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with group-normalized rewards while retaining PPO-style token-level importance sampling based on an old policy. Our theoretical analysis reveals that the GRPO update rule estimates the policy gradient at the old policy rather than the current one; however, since the old policy is refreshed every few steps, the resulting discrepancy remains small and the induced bias is negligible in practice. To empirically validate this insight, we conduct an ablation study that entirely removes importance sampling and performs multiple optimization steps using gradients estimated at a fixed old policy. Remarkably, this simplified variant attains performance comparable to standard GRPO. Motivated by this finding, we propose Trajectory-level Importance-Corrected GRPO (TIC-GRPO), a new algorithm that replaces token-level importance ratios with a single trajectory-level probability ratio, thereby yielding an estimate of the current policy gradient while preserving the critic-free structure. Furthermore, we present the first convergence analysis for GRPO-style methods and show that TIC-GRPO converges faster than GRPO. Finally, empirical results across math reasoning and coding tasks demonstrate the superiority of TIC-GRPO.

Paper Structure

This paper contains 63 sections, 21 theorems, 223 equations, 4 figures, 2 tables.

Key Result

Theorem 4.1

(Convergence of GRPO) Assume that the conditions stated in Assumptions L_smooth is satisfied. Let $\theta_{1,0} \in \mathbb{R}^d$ denote an arbitrary initialization of the algorithm, and we set $\eta=\frac{1}{\sqrt{N}\log|\mathcal{V}|}.$ Then the sequence $\{ \theta_{n,k} \}$ generated by GRPO as de Here $\mathcal{S}_T^{(n)}$ denotes the set of all trajectories sampled under $\pi_{\theta_{n,0}}$ a

Figures (4)

  • Figure 1: Ablation study on Importance Sampling in GRPO using the Qwen3-1.7B. Training reward curves show that removing importance sampling does not negatively impact performance.
  • Figure 2: Training dynamics of different GRPO variants on Qwen3-1.7B and Qwen3-8B models. Left panels show the AIME24 Avg@32 accuracy curves, while right panels report the corresponding training reward curves. TIC-GRPO consistently achieves faster convergence and higher final performance compared with GRPO and GSPO across both model scales.
  • Figure 3: Training dynamics of different GRPO variants on Qwen3-1.7B and Qwen3-8B models. Left panels show the AIME24 Avg@32 accuracy curves, while right panels report the corresponding training reward curves. TIC-GRPO consistently achieves faster convergence and higher final performance compared with GRPO and GSPO across both model scales.
  • Figure 4: Ablation analysis of TIC-GRPO on Qwen3-1.7B.

Theorems & Definitions (44)

  • Theorem 4.1
  • Theorem 4.2: Convergence of GRPO$_2$
  • Theorem 4.3: Convergence of TIC-GRPO
  • Lemma 1.1
  • proof
  • proof
  • proof
  • proof
  • Lemma 4.1
  • Lemma 4.2
  • ...and 34 more