Back to artifacts

Reinforcement Learning: January 2026 Week 3

Jan 15 – Jan 21, 2026 · 78 papers analyzed · 3 breakthroughs

Summary

Week 3 (Jan 15-21): 3 breakthroughs from 78 papers. (1) 2601.11061 proves RLVR activates memorization shortcuts, not reasoning — fundamental critique; (2) 2601.10471 (DeFlow) decouples manifold modeling from value maximization in offline RL; (3) 2601.12008 (EVO) applies Extreme Value Theory for tail-risk safe RL. RLVR critique papers emerge; offline RL gets principled.

Key Takeaway

RLVR's fundamental assumptions challenged; the field starts asking 'does this actually work?'

Breakthroughs (3)

1. Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs

Why Novel: First mechanistic analysis proving RLVR improves benchmark performance via memorization shortcuts, not genuine reasoning. Uses Path Patching to trace activation patterns.

Key Innovations:

  • Shows RLVR with spurious rewards still improves contaminated benchmarks
  • Path Patching reveals memorization circuit activation, not reasoning
  • Challenges the core assumption that RLVR teaches reasoning

Evidence:

  • — Spurious reward experimental setup
  • — Path Patching mechanistic analysis
  • — Activation patterns showing memorization circuits

Impact: Fundamental challenge to RLVR paradigm — performance gains may not reflect capability gains.

2. DeFlow: Decoupling Manifold Modeling and Value Maximization for Offline Policy Extraction

Why Novel: Resolves expressivity-optimization dilemma in offline RL by separating behavior modeling from value refinement. First to use multi-step flow matching for behavior manifold.

Key Innovations:

  • Multi-step flow matching captures behavior manifold faithfully
  • Lightweight instance-aware refinement for value-based improvement
  • Avoids mode collapse from jointly optimizing both objectives

Evidence:

  • — Flow matching for behavior modeling
  • — Decoupled refinement algorithm
  • — D4RL benchmark results

Impact: Provides clean separation of concerns for offline RL, improving both expressivity and optimization.

3. Extreme Value Policy Optimization for Safe Reinforcement Learning

Why Novel: First to integrate Extreme Value Theory into constrained RL for tail-risk safety. GPD-based constraint captures worst-case outcomes, not just average.

Key Innovations:

  • Generalized Pareto Distribution fitted to tail samples for extreme quantile constraint
  • Tail-aware policy optimization avoids catastrophic but rare failures
  • Principled handling of constraint violations in safety-critical domains

Evidence:

  • — EVT integration into constrained RL
  • — EVO algorithm with GPD constraint
  • — Safety benchmark with rare failure scenarios

Impact: Addresses the 'long tail' safety problem that average-case constraints miss.

Trends

  • RLVR under scrutiny — mechanistic critiques emerging

  • Offline RL getting cleaner separation of modeling vs optimization

  • Safe RL moving beyond average-case to tail-risk constraints

  • Process rewards gaining traction over outcome-only verification

Notable Papers (5)

1. PROMA: Projected Microbatch Accumulation for Reference-free Proximal Policy Updates

Eliminates reference model in RLHF via orthogonal gradient projection.

2. Aletheia: What Makes RLVR For Code Verifiers Tick?

Controlled testbed for RLVR components under covariate shift.

3. Orthogonalized Policy Optimization: Decoupling Sampling from Optimization in RLHF

Separates sampling geometry from optimization geometry.

4. BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Boundary-aware constraints for agent action reliability.

5. PRL: Process Reward Learning Improves LLMs' Reasoning Ability

Process-level rewards outperform outcome-only rewards.

Honorable Mentions

  • Factored Value Functions for Graph-Based Multi-Agent RL ()
  • Incentivizing In-depth Reasoning with Process Advantage Shaping ()