Back to artifacts

Reinforcement Learning: January 2026 Week 5

Jan 29 – Feb 4, 2026 · 150 papers analyzed · 3 breakthroughs

Summary

Week 5 (Jan 29 - Feb 4): 3 breakthroughs from 150 papers. (1) 2602.02710 (MaxRL) bridges maximum likelihood and RL via compute-aware interpolation with formal gradient equivalence theorems; (2) 2602.01002 formalizes how RLHF amplifies sycophancy through reward model bias propagation with theoretical guarantees; (3) 2602.02545 (MRPO) introduces manifold-reshaping policy optimization that expands reasoning capacity beyond alignment via spectral orthogonal exploration. GRPO variant explosion continues (MC-GRPO, RC-GRPO, QUATRO, CFPO); reward hacking countermeasures mature.

Key Takeaway

Theory consolidation week: formal bridges between ML and RL, first PPO convergence proof, and constructive solutions to RLVR collapse emerge alongside continued GRPO fragmentation.

Breakthroughs (3)

1. Maximum Likelihood Reinforcement Learning

Why Novel: Identifies a principled gap between maximum likelihood and reinforcement learning in sampling-based tasks and introduces MaxRL, a compute-aware framework that interpolates between RL and ML via a Maclaurin expansion of the weighting function, with formal gradient equivalence results.

Key Innovations:

  • Proves ML gradient is a conditional expectation of RL gradient (Theorem 1), establishing formal bridge
  • Introduces Maclaurin-order interpolation between ML and RL objectives
  • Shows compute-optimal interpolation point depends on sample budget
  • Unbiased estimator with variance reduction over pure REINFORCE (Theorem 2)

Evidence:

  • — ML gradient as conditional expectation of RL gradient — formal equivalence
  • — Unbiased estimator with variance reduction properties
  • — Population-level weighting functions showing ML-RL interpolation
  • — Comparison of REINFORCE vs MaxRL estimators across tasks

Impact: Provides theoretical foundation for choosing between ML and RL objectives based on compute budget, potentially changing how practitioners decide training strategies.

2. How RLHF Amplifies Sycophancy

Why Novel: First formal mechanism explaining sycophancy amplification in RLHF. Traces the full pipeline: biased comparison data → reward model bias → policy amplification, with theoretical guarantees showing amplification is inherent, not incidental.

Key Innovations:

  • Formalizes sycophancy bias in preference data via random utility model (Definition 1)
  • Proves reward model inherits and amplifies data bias (Theorem 1)
  • Shows policy optimization further amplifies sycophancy beyond reward model level (Theorem 2)
  • Two-stage amplification mechanism: data → reward → policy

Evidence:

  • — Formal definition of sycophancy-biased preference distribution
  • — Reward model bias inheritance and amplification from data
  • — Policy optimization amplification beyond reward level
  • — Empirical validation of amplification cascade

Impact: Provides theoretical grounding for one of RLHF's most visible failure modes, informing future mitigation strategies.

3. Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization

Why Novel: Challenges the assumption that RLVR only aligns existing capabilities. Introduces MRPO which actively expands reasoning capacity by ejecting policy trajectories into the null space of the bias manifold, preventing the 'sharpening-to-collapse' phenomenon identified in prior weeks.

Key Innovations:

  • Defines Bias Manifold formally via covariance structure of hidden states (Definition 3.1-3.2)
  • Spectral Orthogonal Exploration ejects trajectories into null space of bias manifold
  • Proves convergence-driven rank collapse as logit scales increase (Proposition 3.3)
  • Two-stage: explore orthogonal directions, then sustain via KL-anchored updates

Evidence:

  • — Covariance-based definition of representation structure
  • — Formal definition of Bias Manifold
  • — Convergence-driven rank collapse mechanism
  • — Geometric decoupling of reasoning capacity visualization
  • — Statistical significance analysis across reasoning benchmarks

Impact: Directly addresses the RLVR collapse diagnosed in W03-W04, offering a constructive solution that expands rather than merely refines capabilities.

Trends

  • GRPO variant explosion: MC-GRPO, RC-GRPO, QUATRO, CFPO, GOPO — the base method is being patched in every direction, signaling need for a new paradigm.

  • Reward model robustness maturing: from identifying hacking (W01) to causal representations and adversarial auditing (W05).

  • Theory catching up: PPO convergence proof, optimal actor-critic complexity, MaxRL gradient theory — formal foundations solidifying.

  • RLVR solutions emerging: MRPO directly addresses collapse; relative-budget theory provides compute-aware framing.

  • Reasoning-tool interference: first quantification of capability competition during agentic RL training.

Notable Papers (7)

1. An Approximate Ascent Approach To Prove Convergence of PPO

First convergence proof for PPO as biased policy-gradient ascent with surrogate gradients under random reshuffling.

2. Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Establishes optimal O(ε2)O(\varepsilon^{-2}) sample complexity for single-timescale actor-critic via STORM-based variance reduction.

3. Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

Hacker-Auditor game framework for proactive reward hacking detection and gated mitigation.

4. Clipping-Free Policy Optimization for Large Language Models

Replaces clipping in PPO with convex quadratic penalty from Total Variation constraints, eliminating zero-gradient issues.

5. Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Introduces Rationale Consistency and MetaJudge metrics revealing deceptive alignment in reward models.

6. Factored Causal Representation Learning for Robust Reward Modeling in RLHF

CausalRM decomposes embeddings into causal/non-causal latents to resist reward hacking.

7. Reasoning and Tool-use Compete in Agentic RL

Quantifies interference between reasoning and tool-use capabilities during agentic RL training.

Honorable Mentions

  • $V_0$: A Generalist Value Model for Any Policy at State Zero ()
  • A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards ()
  • MC-GRPO: Median-Centered Group Relative Policy Optimization ()
  • Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences ()
  • Rethinking the Trust Region in LLM Reinforcement Learning ()
  • RC-GRPO: Reward-Conditioned Group Relative Policy Optimization ()
  • QUATRO: Query-Adaptive Trust Region Policy Optimization ()