Reinforcement Learning: January 2026 Week 3
Jan 15 – Jan 21, 2026 · 78 papers analyzed · 3 breakthroughs
Summary
Week 3 (Jan 15-21): 3 breakthroughs from 78 papers. (1) 2601.11061 proves RLVR activates memorization shortcuts, not reasoning — fundamental critique; (2) 2601.10471 (DeFlow) decouples manifold modeling from value maximization in offline RL; (3) 2601.12008 (EVO) applies Extreme Value Theory for tail-risk safe RL. RLVR critique papers emerge; offline RL gets principled.
Key Takeaway
RLVR's fundamental assumptions challenged; the field starts asking 'does this actually work?'
Breakthroughs (3)
1. Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
Why Novel: First mechanistic analysis proving RLVR improves benchmark performance via memorization shortcuts, not genuine reasoning. Uses Path Patching to trace activation patterns.
Key Innovations:
- Shows RLVR with spurious rewards still improves contaminated benchmarks
- Path Patching reveals memorization circuit activation, not reasoning
- Challenges the core assumption that RLVR teaches reasoning
Evidence:
- — Spurious reward experimental setup
- — Path Patching mechanistic analysis
- — Activation patterns showing memorization circuits
Impact: Fundamental challenge to RLVR paradigm — performance gains may not reflect capability gains.
2. DeFlow: Decoupling Manifold Modeling and Value Maximization for Offline Policy Extraction
Why Novel: Resolves expressivity-optimization dilemma in offline RL by separating behavior modeling from value refinement. First to use multi-step flow matching for behavior manifold.
Key Innovations:
- Multi-step flow matching captures behavior manifold faithfully
- Lightweight instance-aware refinement for value-based improvement
- Avoids mode collapse from jointly optimizing both objectives
Evidence:
- — Flow matching for behavior modeling
- — Decoupled refinement algorithm
- — D4RL benchmark results
Impact: Provides clean separation of concerns for offline RL, improving both expressivity and optimization.
3. Extreme Value Policy Optimization for Safe Reinforcement Learning
Why Novel: First to integrate Extreme Value Theory into constrained RL for tail-risk safety. GPD-based constraint captures worst-case outcomes, not just average.
Key Innovations:
- Generalized Pareto Distribution fitted to tail samples for extreme quantile constraint
- Tail-aware policy optimization avoids catastrophic but rare failures
- Principled handling of constraint violations in safety-critical domains
Evidence:
- — EVT integration into constrained RL
- — EVO algorithm with GPD constraint
- — Safety benchmark with rare failure scenarios
Impact: Addresses the 'long tail' safety problem that average-case constraints miss.
Trends
RLVR under scrutiny — mechanistic critiques emerging
Offline RL getting cleaner separation of modeling vs optimization
Safe RL moving beyond average-case to tail-risk constraints
Process rewards gaining traction over outcome-only verification
Notable Papers (5)
1. PROMA: Projected Microbatch Accumulation for Reference-free Proximal Policy Updates
Eliminates reference model in RLHF via orthogonal gradient projection.
2. Aletheia: What Makes RLVR For Code Verifiers Tick?
Controlled testbed for RLVR components under covariate shift.
3. Orthogonalized Policy Optimization: Decoupling Sampling from Optimization in RLHF
Separates sampling geometry from optimization geometry.
4. BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search
Boundary-aware constraints for agent action reliability.
5. PRL: Process Reward Learning Improves LLMs' Reasoning Ability
Process-level rewards outperform outcome-only rewards.
Honorable Mentions
- Factored Value Functions for Graph-Based Multi-Agent RL ()
- Incentivizing In-depth Reasoning with Process Advantage Shaping ()