Table of Contents
Fetching ...

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

Md Tanvirul Alam, Nidhi Rastogi

TL;DR

This paper probes the limits of RLVR in fostering genuine mathematical reasoning by studying two verifiable combinatorial problems with unique optima. It systematically compares multiple reward designs and evaluation metrics to distinguish true reasoning from shortcut exploitation. The findings reveal that RLVR often improves surface metrics but does not consistently induce deeper reasoning; activity scheduling shows clearer gains in reasoning fidelity than LIS, where benefits arise from heuristics or formatting. These results underline the need for benchmarks and diagnostics that disentangle reasoning from pattern-matching, informing better reward designs and evaluation protocols for mathematical reasoning.

Abstract

Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

TL;DR

This paper probes the limits of RLVR in fostering genuine mathematical reasoning by studying two verifiable combinatorial problems with unique optima. It systematically compares multiple reward designs and evaluation metrics to distinguish true reasoning from shortcut exploitation. The findings reveal that RLVR often improves surface metrics but does not consistently induce deeper reasoning; activity scheduling shows clearer gains in reasoning fidelity than LIS, where benefits arise from heuristics or formatting. These results underline the need for benchmarks and diagnostics that disentangle reasoning from pattern-matching, informing better reward designs and evaluation protocols for mathematical reasoning.

Abstract

Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emph{Activity Scheduling} and the \emph{Longest Increasing Subsequence}, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.

Paper Structure

This paper contains 48 sections, 14 equations, 12 figures, 1 table, 4 algorithms.

Figures (12)

  • Figure 1: Example question and ground-truth for Activity Scheduling (left) and LIS (right).
  • Figure 2: Performance comparison of Base, RL($r_{\text{ans}}$), and RL($r_{\text{ans+fmt}}$) models on the Activity and LIS task with the Qwen2.5-7B model.
  • Figure 3: Performance comparison of RL models trained with $r_{ids,exa}$ and $r_{ids,pre}$.
  • Figure 4: Sorting accuracy and LCS across models.
  • Figure 5: Curriculum experiments with $r_{\text{sort}}$. Each curve shows accuracy on the training set when models are trained with $r_{\text{sort}}$ for the first 10, 20, or 30 PPO updates, followed by $r_{\text{ans}}$ for the remainder. Longer pretraining with $r_{\text{sort}}$ makes it increasingly difficult for the model to recover under $r_{\text{ans}}$.
  • ...and 7 more figures