Back to artifacts

Reinforcement Learning: February 2026 Week 7

Feb 12 – Feb 18, 2026 · 135 papers analyzed · 3 breakthroughs

Summary

Week 7 (Feb 12-18): 3 breakthroughs from 135 papers. (1) 2602.10286 proves what pairwise preference data actually recovers — ordering structure not scalar rewards, explaining DPO failure modes; (2) 2602.12386 establishes first provably convergent actor-critic for risk-averse MARL via Risk-averse Quantal Response Equilibria; (3) 2602.10609 introduces causal Kalman filtering for LLM policy optimization that tracks off-policy deviation across token sequences rather than clipping independently. Multi-agent RL theory advances; preference alignment foundations questioned.

Key Takeaway

Foundations week: what does preference data recover? How do we get convergent MARL? The field interrogates its assumptions while building more principled sequential off-policy methods.

Breakthroughs (3)

1. What Does Preference Learning Recover from Pairwise Comparison Data?

Why Novel: Standard practice assumes pairwise data follows Bradley-Terry model and learns latent scores. This work proves that under violations, learned models may not even preserve ordinal structure, explaining why reward models can fail catastrophically despite low preference prediction loss.

Key Innovations:

  • [object Object]
  • [object Object]
  • [object Object]

Evidence:

  • — undefined
  • — undefined
  • — undefined
  • — undefined

Impact: Fundamental questioning of reward modeling foundations — suggests ordinal methods may be more appropriate than cardinal for alignment.

2. Provably Convergent Actor-Critic in Risk-averse MARL

Why Novel: Computing stationary equilibria in general-sum games is intractable, unlike single-agent or zero-sum settings. RQE is computationally tractable while retaining game-theoretic meaning, enabling first convergence guarantees for multi-agent actor-critic.

Key Innovations:

  • [object Object]
  • [object Object]
  • [object Object]

Evidence:

  • — undefined
  • — undefined
  • — undefined

Impact: Opens MARL to principled actor-critic methods with guarantees, beyond empirical approaches that may cycle or diverge.

3. Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Why Novel: PPO-style clipping treats each token independently, ignoring that off-policy deviation compounds across a sequence. Fixed sequence-level ratios lose token-level information. Causal Kalman filtering tracks the evolving off-policy state, enabling locally-aware corrections.

Key Innovations:

  • [object Object]
  • [object Object]
  • [object Object]

Evidence:

  • — undefined
  • — undefined
  • — undefined

Impact: Principled treatment of sequential off-policy structure that PPO clipping ignores; potential replacement for ratio clipping in LLM training.

Trends

  • Preference learning foundations questioned: what does pairwise data actually encode? Ordinal vs cardinal structure distinction becomes critical.

  • MARL theory advances: first convergent actor-critic for general-sum games via RQE; multi-agent LLM training analyzed.

  • Sequential structure in LLM training: Kalman filtering for token sequences; causal credit assignment continues from W06.

  • Reward model robustness: Bayesian non-negative modeling, wild interaction data, addressing systematic biases.

  • Off-policy stability: various approaches to the fundamental tension between sample reuse and policy drift.

Notable Papers (7)

1. Unifying Stable Optimization and Reference Regularization in RLHF

Unified framework combining reward hacking mitigation (π0\pi_0 KL penalty) and stable optimization (πt\pi_t clipping) into single objective.

2. Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

BNRM uses sparse non-negative factors to make reward models robust to annotation noise and style biases.

3. General Flexible ff-divergence for Challenging Offline RL Datasets

Adaptive ff-divergence selection handles datasets with low stochasticity and diverse behavior policies.

4. Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Agent-specific baselines fix gradient-norm instability when GRPO extended to multi-agent.

5. Amortized Molecular Optimization via Group Relative Policy Optimization

GRPO applied to molecular design, enabling amortized optimization across molecule populations.

6. VESPO: Variational Sequence-Level Soft Policy Optimization

Variational formulation for stable off-policy LLM training with asynchronous updates.

7. WildReward: Learning Reward Models from In-the-Wild Human Interactions

Reward model training from deployed LLM interaction logs without explicit annotation.

Honorable Mentions

  • RePO: Bridging On-Policy Learning and Off-Policy Knowledge via Rephrasing ()
  • Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated Reasoning ()
  • Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards ()
  • Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards ()
  • Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula ()
  • Mitigating Mismatch within Reference-based Preference Optimization ()