Reinforcement Learning: March 2026 Week 12

Mar 16 – Mar 22, 2026 · 104 papers analyzed · 3 breakthroughs

Summary

104 papers analyzed for 2026-03-16 to 2026-03-22. 3 breakthroughs, 7 notable. Key findings: (1) 2603.19335 (OXRL) runs 240 controlled experiments across 51 post-training algorithms and reveals scale-dependent ranking inversions — algorithms that lead at 0.5B lose at 7B, GSM8K performance gaps of 19 pp collapse to <1 pp on MATH, calling into question most published algorithm comparisons; (2) 2603.17577 provides the first identifiability theory for recovering latent actions from action-free offline trajectories with diverse demonstrators, with formal theorems for finite and continuous observation spaces; (3) 2603.15001 (LB-SGB) fixes a known flaw in Stochastic Gradient Bandit convergence theory via log-barrier regularization, proving worst-case iteration complexity with guarantees that the optimal arm never has zero probability. Dominant trend: RLVR/post-training for LLMs continues to be the busiest area with diverse new tricks (DyJR, Beta-Bernoulli reward estimation, TTRL enhancements).

Key Takeaway

The biggest RL story this week isn't a new algorithm — it's OXRL's evidence that most post-training algorithm comparisons are methodologically broken; meanwhile, identifiability theory for latent actions may unlock web-scale RL pretraining.

Breakthroughs (3)

1. Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Why Novel: First large-scale controlled comparison (OXRL framework, ~240 runs) where all algorithms share identical infrastructure. Reveals that benchmark-specific rankings are not transferable, and LoRA vs full fine-tuning differences are negligible at 3B — challenging the premise of most algorithm papers.

Key Innovations:

[object Object]
[object Object]
[object Object]
[object Object]

Evidence:

— undefined
— undefined
— undefined
— undefined
— undefined
— undefined
— undefined

Impact: Fundamentally challenges how algorithm papers in the post-training space should be evaluated — single-benchmark, single-scale comparisons likely mislead, and the field may be overstating algorithmic progress.

2. Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity

Why Novel: First formal identifiability theory for the latent action problem. Prior work (BCO, GAIfO, ILPO) learned policies from observation-only data empirically, but lacked guarantees. This work proves when and why recovery is possible, revealing demonstrator diversity as the key structural condition.

Key Innovations:

[object Object]
[object Object]
[object Object]

Evidence:

— undefined
— undefined
— undefined
— undefined
— undefined
— undefined

Impact: Opens a theoretical foundation for using internet-scale action-free video data (robot videos, gameplay, screen recordings) to pre-train RL policies — the missing piece for web-scale RL pretraining.

3. How Log-Barrier Helps Exploration in Policy Optimization

Why Novel: Prior SGB convergence analysis relied on a hidden assumption about exploration identified by [baudrydoes] — existing proofs only hold if the optimal arm is already sampled with positive probability. LB-SGB removes this assumption entirely via a self-bounding property of the regularized objective.

Key Innovations:

[object Object]
[object Object]
[object Object]

Evidence:

— undefined
— undefined
— undefined
— undefined
— undefined
— undefined

Impact: Resolves a foundational gap in policy gradient bandit theory; log-barrier regularization has direct implications for safe RL and exploration-guaranteed policy optimization methods.

Trends

RLVR/post-training for LLMs dominates the week: DyJR, Beta-Bernoulli estimation, context bootstrapping, noisy reward filtering, and TTRL enhancements all target the same bottleneck (sample efficiency + reward quality in GRPO-style training).
Theoretical foundations catching up to practice: both the latent action identifiability paper (2603.17577) and the log-barrier bandit paper (2603.15001) fix gaps that practitioners had been ignoring — suggesting the field is maturing.
Offline-to-online RL transition is an emerging problem: papers on reward shaping for frontier exploration (2603.18326) and offline safe RL (2603.15136) address the practical challenge of deploying offline-trained agents in live environments.
Scale-dependence of algorithm comparisons: OXRL (2603.19335) makes the strongest empirical case yet that most post-training papers compare apples to oranges across scales.

Notable Papers (7)

1. Context Bootstrapped Reinforcement Learning

Addresses exploration inefficiency in RLVR by bootstrapping context from successful rollouts to increase the rate of positive learning signals without additional labeling.

2. Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning

Shows that diverse reset distributions alone — without per-task reward engineering, curricula, or demonstrations — enable emergent dexterous manipulation skills in sim-to-real robot learning.

3. DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

Improves GRPO sample efficiency via experience replay with Jensen-Shannon divergence gating to avoid mode collapse, improving both diversity and performance on LLM reasoning tasks.

4. Discounted Beta–Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Replaces point-estimate rewards in RLVR with Bayesian Beta-Bernoulli estimation using discounted history, improving sample efficiency and convergence stability in LLM post-training.

5. What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

Augments Test-Time RL with negative pseudo-labels from consensus failures, addressing the vulnerability of TTRL to challenging scenarios where majority voting produces wrong answers.

6. Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

Uses orthogonal manifold-seeking reward shaping (proven via Theorem 1) to guide offline-trained policies toward out-of-distribution frontier states during online fine-tuning, escaping pessimistic policy support.

7. Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Reformulates offline safe RL using reachability flow policies, avoiding soft expected-cost approximations and enabling hard safety constraint satisfaction with theoretical guarantees.

Honorable Mentions

Complementary Reinforcement Learning ()
Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning ()
Unified Policy Value Decomposition for Rapid Adaptation ()
Do Post-Training Algorithms Actually Differ? ()
Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning ()
Benchmarking Reinforcement Learning via Stochastic Converse Optimality ()