Table of Contents
Fetching ...

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh

TL;DR

The paper interrogates whether current RL benchmarks reliably measure generalization in LLM alignment and shows that high scores can mask brittle, shortcut-driven behavior. It introduces the Oracle Performance Gap (OPG) and a suite of stress tests—across difficulty, distribution, and counterfactual reasoning—to diagnose benchmark flaws. Results reveal a vanishing generalization gap for RL methods, contrasting with SFT, and demonstrate that averaging over tasks hides meaningful differences in cross-difficulty generalization and OOD robustness. The authors propose three design principles—sufficient difficulty, balanced evaluation, and distributional robustness—to guide the creation of more faithful benchmarks and promote genuinely transferable RL-powered reasoning. Adopting these principles is essential to ensure progress reflects robust generalization rather than artifact-driven scores.

Abstract

Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress.To study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to reveal.We conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

TL;DR

The paper interrogates whether current RL benchmarks reliably measure generalization in LLM alignment and shows that high scores can mask brittle, shortcut-driven behavior. It introduces the Oracle Performance Gap (OPG) and a suite of stress tests—across difficulty, distribution, and counterfactual reasoning—to diagnose benchmark flaws. Results reveal a vanishing generalization gap for RL methods, contrasting with SFT, and demonstrate that averaging over tasks hides meaningful differences in cross-difficulty generalization and OOD robustness. The authors propose three design principles—sufficient difficulty, balanced evaluation, and distributional robustness—to guide the creation of more faithful benchmarks and promote genuinely transferable RL-powered reasoning. Adopting these principles is essential to ensure progress reflects robust generalization rather than artifact-driven scores.

Abstract

Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress.To study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to reveal.We conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.

Paper Structure

This paper contains 43 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of our empirical framework. The workflow begins by diagnosing benchmark flaws with novel metrics to uncover a core symptom: a vanishing generalization gap. It then proceeds through a suite of stress tests that reveal the brittle, shortcut-based nature of the learned skills, culminating in a new set of principles for more robust evaluation.
  • Figure 2: The Illusion of Average Performance.(a) The mean performance gap between the best (specialist) model and the average of all other models widens dramatically as task difficulty increases. (b) Surprisingly, the average scores of these specialists (calculated across all five difficulty partitions) are nearly identical. This contrast illustrates how a difficulty-agnostic evaluation can mask substantial differences in generalization capability. Full performance data is provided in Appendix \ref{['app:c_result']}.
  • Figure 3: The Average Cross-Difficulty Generalization score for 3B and 7B models. The y-axis represents the average accuracy of a specialist model (trained on level $L_i$) on all other, unseen difficulty levels. Both models show a clear trend: as the training data complexity increases, the model's ability to generalize to other difficulties improves, with the Level 5-trained model being the strongest generalist.
  • Figure 4: Performance collapse.
  • Figure 4: Asymmetric Generalization is consistent across model scales. Across both the 3B model (a) and the 7B model (b), training on high-difficulty problems (L4-L5, orange line) yields a uniformly superior performance lift over training on easier problems (L1-L3, blue line), proving that mastering complexity is essential for acquiring robust, transferable skills.Full performance data is provided in Table \ref{['tab:complexity-test-3b-appendix']} and Table \ref{['tab:complexity-test-7b-appendix']}.
  • ...and 1 more figures