Back to artifacts

Reinforcement Learning: February 2026 Week 6

Feb 5 – Feb 11, 2026 · 120 papers analyzed · 3 breakthroughs

Summary

Week 6 (Feb 5-11): 3 breakthroughs from 120 papers. (1) 2602.10838 establishes rigorous convergence theory for mirror descent actor-critic in general Polish spaces with explicit bounds on TD steps needed; (2) 2602.09331 introduces counterfactual causal credit assignment that weights tokens by their measured impact on final answers without auxiliary models; (3) 2602.05717 (Anchored Policy Optimization) diagnoses Recursive Space Contraction as root cause of RLVR collapse and provides support-constrained fix. GRPO stabilization methods proliferate (EBPO, iGRPO, CGRPO, FGO); credit assignment emerges as central theme.

Key Takeaway

Credit assignment emerges as the fundamental challenge: GRPO variants proliferate to fix various instabilities while new methods attack the root cause of uniform token weighting.

Breakthroughs (3)

1. Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence

Why Novel: Prior actor-critic convergence results either assume exact advantage computation or finite state-action spaces. This work handles both continuous spaces and the approximation error from TD learning, providing explicit bounds on how many TD steps are needed per policy update.

Key Innovations:

  • [object Object]
  • [object Object]
  • [object Object]

Evidence:

  • — undefined
  • — undefined
  • — undefined

Impact: Provides theoretical foundation for practical actor-critic implementations in continuous domains where exact value computation is impossible.

2. Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization

Why Novel: GRPO/DAPO assign uniform credit across all tokens despite wildly varying importance. Prior solutions require auxiliary models or external annotation. This method extracts importance directly from the policy itself via counterfactual evaluation.

Key Innovations:

  • [object Object]
  • [object Object]
  • [object Object]

Evidence:

  • — undefined
  • — undefined
  • — undefined

Impact: Addresses the fundamental flaw in GRPO-style methods where filler phrases receive same gradient as critical calculations.

3. Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

Why Novel: Prior RLVR collapse diagnoses focused on symptoms (vanishing gradients, mode collapse). This work traces to the interaction between positive and negative example dynamics showing why KL regularization's shape-matching constraint is fundamentally insufficient.

Key Innovations:

  • [object Object]
  • [object Object]
  • [object Object]

Evidence:

  • — undefined
  • — undefined
  • — undefined
  • — undefined

Impact: Provides mechanistic understanding of RLVR collapse beyond prior symptom-level analysis, with constructive solution that preserves exploration.

Trends

  • Credit assignment becomes central: counterfactual importance, step-level rewards, fine-grained weighting all address the uniform credit flaw in GRPO.

  • GRPO stabilization methods multiply: EBPO (Bayes shrinkage), iGRPO (self-feedback), CGRPO (constraints), FGO (fine-grained) — each patches a different instability mode.

  • Actor-critic theory matures: first rigorous convergence in general spaces fills critical gap between theory and practice.

  • RLVR collapse mechanisms clarified: RSC diagnosis more precise than prior 'sharpening-to-collapse' descriptions.

  • Multi-agent MARL for LLMs: Dr. MAS provides first theoretical analysis of extending GRPO to agent populations.

Notable Papers (8)

1. EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

James-Stein shrinkage on GRPO advantage estimates to combat high variance under small group sizes.

2. iGRPO: Self-Feedback-Driven LLM Reasoning

Model generates own feedback for iterative GRPO without external reward models.

3. Constrained Group Relative Policy Optimization

Lagrangian extension of GRPO for explicit behavioral constraints with safety indicator costs.

4. Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

FGO refines group responses via length and entropy weighting for CoT compression.

5. Diffusion-State Policy Optimization for Masked Diffusion Language Models

DiSPO provides intermediate credit assignment for masked diffusion via branched resampling.

6. Distributional Reinforcement Learning with Diffusion Bridge Critics

Diffusion bridge for full return distribution modeling in continuous control.

7. Displacement-Resistant Extensions of DPO with Nonconvex ff-Divergences

Proves tractability extends beyond convex ff-divergences; nonconvex choices resist displacement.

8. Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Agent-specific baselines fix gradient instability when extending GRPO to multi-agent LLM systems.

Honorable Mentions

  • Variance Reduction Based Experience Replay for Policy Optimization ()
  • Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization ()
  • Difficulty-Estimated Policy Optimization ()
  • Length-Unbiased Sequence Policy Optimization ()
  • Trust Regions Sell, But Who's Buying? Overlap Geometry as Alternative Trust Region ()
  • AceGRPO: Adaptive Curriculum Enhanced GRPO for Autonomous ML Engineering ()