Back to artifacts

Reinforcement Learning: January 2026 Week 2

Jan 8 – Jan 14, 2026 · 74 papers analyzed · 3 breakthroughs

Summary

Week 2 (Jan 8-14): 3 breakthroughs from 74 papers. (1) 2601.05242 (GDPO) decouples group rewards to prevent signal collapse in multi-reward RL; (2) 2601.05870 (IIB-LPO) tackles exploration collapse in RLVR via latent branching; (3) 2601.06336 (Foresight Learning) extends RLVR to real-world temporal prediction. GRPO variants proliferating signals base method limitations.

Key Takeaway

GRPO fragmentation begins; RLVR exploration issues identified but not yet diagnosed at root cause.

Breakthroughs (3)

1. GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Why Novel: Shows that naively applying GRPO to aggregated rewards collapses distinct reward signals. Introduces group-decoupled normalization to preserve multi-reward diversity.

Key Innovations:

  • Identifies reward signal collapse in standard GRPO with multiple rewards
  • Group-decoupled normalization preserves individual reward gradients
  • Enables alignment to diverse human preferences simultaneously

Evidence:

  • — Analysis of reward collapse in GRPO
  • — GDPO algorithm with decoupled normalization
  • — Multi-reward benchmark improvements

Impact: Fixes a fundamental flaw in GRPO for multi-objective alignment.

2. IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck

Why Novel: Shifts RLVR exploration from token distribution perturbations to topological diversification of reasoning trajectories. First to use information bottleneck for reasoning diversity.

Key Innovations:

  • Triggers latent branching at high-entropy states instead of token-level perturbation
  • Iterative information bottleneck identifies reasoning branch points
  • Prevents exploration collapse in long-horizon reasoning

Evidence:

  • — Information bottleneck formulation for branching
  • — Latent policy optimization algorithm
  • — Reasoning trajectory diversity visualization

Impact: Provides principled solution to exploration collapse that plagues RLVR methods.

3. Future-as-Label: Scalable Supervision from Real-World Outcomes

Why Novel: Extends RLVR paradigm beyond synthetic verifiers to real-world temporal prediction tasks. Predictions at time t are verified by outcomes revealed after t.

Key Innovations:

  • Foresight Learning: use future outcomes as automatic labels
  • Temporally causal: predictions use only past information
  • Scalable to domains without synthetic verification

Evidence:

  • — Foresight Learning formulation
  • — Real-world prediction experiments

Impact: Opens RLVR to domains where synthetic verification is impossible.

Trends

  • GRPO variants proliferating (GDPO, GDEPO) — signals base method has limitations

  • RLVR exploration collapse being actively addressed

  • Multi-reward and multi-objective alignment gaining attention

Notable Papers (5)

1. GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization

Fixes GRPO inefficiencies for automated theorem proving via dual-dynamic optimization.

2. On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning

Progress-based reward + value-free PPO for VLA test-time adaptation.

3. Reward-Preserving Attacks For Robust Reinforcement Learning

Adversarial attacks that preserve reward signal to test RL robustness.

4. Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR

Hybrid token-sequence policy for RLVR flexibility.

5. Chaining the Evidence: Robust RL for Deep Search Agents with Citation-Aware Rubric Rewards

Citation-aware rewards for research agent training.

Honorable Mentions

  • AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search ()
  • SketchVL: Policy Optimization via Fine-Grained Credit Assignment ()