Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh
TL;DR
The paper interrogates whether current RL benchmarks reliably measure generalization in LLM alignment and shows that high scores can mask brittle, shortcut-driven behavior. It introduces the Oracle Performance Gap (OPG) and a suite of stress tests—across difficulty, distribution, and counterfactual reasoning—to diagnose benchmark flaws. Results reveal a vanishing generalization gap for RL methods, contrasting with SFT, and demonstrate that averaging over tasks hides meaningful differences in cross-difficulty generalization and OOD robustness. The authors propose three design principles—sufficient difficulty, balanced evaluation, and distributional robustness—to guide the creation of more faithful benchmarks and promote genuinely transferable RL-powered reasoning. Adopting these principles is essential to ensure progress reflects robust generalization rather than artifact-driven scores.
Abstract
Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress.To study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to reveal.We conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.
