Reinforcement Learning: January 2026 Week 2
Jan 8 – Jan 14, 2026 · 74 papers analyzed · 3 breakthroughs
Summary
Week 2 (Jan 8-14): 3 breakthroughs from 74 papers. (1) 2601.05242 (GDPO) decouples group rewards to prevent signal collapse in multi-reward RL; (2) 2601.05870 (IIB-LPO) tackles exploration collapse in RLVR via latent branching; (3) 2601.06336 (Foresight Learning) extends RLVR to real-world temporal prediction. GRPO variants proliferating signals base method limitations.
Key Takeaway
GRPO fragmentation begins; RLVR exploration issues identified but not yet diagnosed at root cause.
Breakthroughs (3)
1. GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Why Novel: Shows that naively applying GRPO to aggregated rewards collapses distinct reward signals. Introduces group-decoupled normalization to preserve multi-reward diversity.
Key Innovations:
- Identifies reward signal collapse in standard GRPO with multiple rewards
- Group-decoupled normalization preserves individual reward gradients
- Enables alignment to diverse human preferences simultaneously
Evidence:
- — Analysis of reward collapse in GRPO
- — GDPO algorithm with decoupled normalization
- — Multi-reward benchmark improvements
Impact: Fixes a fundamental flaw in GRPO for multi-objective alignment.
2. IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck
Why Novel: Shifts RLVR exploration from token distribution perturbations to topological diversification of reasoning trajectories. First to use information bottleneck for reasoning diversity.
Key Innovations:
- Triggers latent branching at high-entropy states instead of token-level perturbation
- Iterative information bottleneck identifies reasoning branch points
- Prevents exploration collapse in long-horizon reasoning
Evidence:
- — Information bottleneck formulation for branching
- — Latent policy optimization algorithm
- — Reasoning trajectory diversity visualization
Impact: Provides principled solution to exploration collapse that plagues RLVR methods.
3. Future-as-Label: Scalable Supervision from Real-World Outcomes
Why Novel: Extends RLVR paradigm beyond synthetic verifiers to real-world temporal prediction tasks. Predictions at time t are verified by outcomes revealed after t.
Key Innovations:
- Foresight Learning: use future outcomes as automatic labels
- Temporally causal: predictions use only past information
- Scalable to domains without synthetic verification
Evidence:
- — Foresight Learning formulation
- — Real-world prediction experiments
Impact: Opens RLVR to domains where synthetic verification is impossible.
Trends
GRPO variants proliferating (GDPO, GDEPO) — signals base method has limitations
RLVR exploration collapse being actively addressed
Multi-reward and multi-objective alignment gaining attention
Notable Papers (5)
1. GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization
Fixes GRPO inefficiencies for automated theorem proving via dual-dynamic optimization.
2. On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning
Progress-based reward + value-free PPO for VLA test-time adaptation.
3. Reward-Preserving Attacks For Robust Reinforcement Learning
Adversarial attacks that preserve reward signal to test RL robustness.
4. Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR
Hybrid token-sequence policy for RLVR flexibility.
5. Chaining the Evidence: Robust RL for Deep Search Agents with Citation-Aware Rubric Rewards
Citation-aware rewards for research agent training.
Honorable Mentions
- AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search ()
- SketchVL: Policy Optimization via Fine-Grained Credit Assignment ()