Table of Contents
Fetching ...

VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

Jian Yao, Ran Cheng, Kay Chen Tan

TL;DR

VAR-MATH introduces a symbolic multi-instantiation evaluation to probe true mathematical reasoning in LLMs, addressing contamination and fragility of single-instance benchmarks. By transforming fixed problems from $AMC23$, $AIME24$, and $AIME25$ into parameterized templates and testing across multiple variants, the framework enforces reasoning consistency and uses loose and strict metrics with bootstrap to stabilize estimates. empirical results show RL-finetuned models suffer substantial drops on VAR-MATH, revealing reliance on memorization and superficial cues rather than robust generalization. The work highlights the need for contamination-resistant evaluation in reasoning tasks and offers a generalizable paradigm for rigorous assessment across domains.

Abstract

Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of LLMs, as measured by standard benchmarks. Yet these gains often persist even when models are trained with flawed signals, such as random or inverted rewards. This raises a fundamental question: do such improvements reflect genuine reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To answer this question, we adopt an evaluation-centric perspective and highlight two critical shortcomings in existing protocols. First, benchmark contamination arises because test problems are publicly available, thereby increasing the risk of data leakage. Second, evaluation fragility results from reliance on single-instance assessments, which are sensitive to stochastic outputs and fail to capture reasoning consistency. These limitations suggest the need for a new evaluation paradigm that can probe reasoning ability beyond memorization and one-off success. As response, we propose VAR-MATH, a symbolic evaluation framework that converts fixed numerical problems into parameterized templates and requires models to solve multiple instantiations of each. This design enforces consistency across structurally equivalent variants, mitigates contamination, and enhances robustness through bootstrapped metrics. We apply VAR-MATH to transform three popular benchmarks, AMC23, AIME24, and AIME25, into their symbolic counterparts, VAR-AMC23, VAR-AIME24, and VAR-AIME25. Experimental results show substantial performance drops for RL-trained models on these variabilized benchmarks, especially for smaller models, with average declines of 47.9\% on AMC23, 58.8\% on AIME24, and 72.9\% on AIME25. These findings indicate that some existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms.

VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

TL;DR

VAR-MATH introduces a symbolic multi-instantiation evaluation to probe true mathematical reasoning in LLMs, addressing contamination and fragility of single-instance benchmarks. By transforming fixed problems from , , and into parameterized templates and testing across multiple variants, the framework enforces reasoning consistency and uses loose and strict metrics with bootstrap to stabilize estimates. empirical results show RL-finetuned models suffer substantial drops on VAR-MATH, revealing reliance on memorization and superficial cues rather than robust generalization. The work highlights the need for contamination-resistant evaluation in reasoning tasks and offers a generalizable paradigm for rigorous assessment across domains.

Abstract

Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of LLMs, as measured by standard benchmarks. Yet these gains often persist even when models are trained with flawed signals, such as random or inverted rewards. This raises a fundamental question: do such improvements reflect genuine reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To answer this question, we adopt an evaluation-centric perspective and highlight two critical shortcomings in existing protocols. First, benchmark contamination arises because test problems are publicly available, thereby increasing the risk of data leakage. Second, evaluation fragility results from reliance on single-instance assessments, which are sensitive to stochastic outputs and fail to capture reasoning consistency. These limitations suggest the need for a new evaluation paradigm that can probe reasoning ability beyond memorization and one-off success. As response, we propose VAR-MATH, a symbolic evaluation framework that converts fixed numerical problems into parameterized templates and requires models to solve multiple instantiations of each. This design enforces consistency across structurally equivalent variants, mitigates contamination, and enhances robustness through bootstrapped metrics. We apply VAR-MATH to transform three popular benchmarks, AMC23, AIME24, and AIME25, into their symbolic counterparts, VAR-AMC23, VAR-AIME24, and VAR-AIME25. Experimental results show substantial performance drops for RL-trained models on these variabilized benchmarks, especially for smaller models, with average declines of 47.9\% on AMC23, 58.8\% on AIME24, and 72.9\% on AIME25. These findings indicate that some existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms.

Paper Structure

This paper contains 25 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Multi-Instance Verification (VAR-MATH) vs. Single-Instance Assessment
  • Figure 2: Overview of the VAR-MATH pipeline. The process consists of two stages: preprocessing, where original problems are symbolically abstracted by replacing constants with variables and defining feasible sampling ranges, and evaluation, where problems are instantiated into multiple concrete variants and assessed using loose (LM) and strict (SM) consistency metrics.
  • Figure 3: Illustration of the bootstrap procedure with $K=5$ variants.
  • Figure 4: Standard deviation of model scores. VAR-MATH significantly reduces output variance across AMC23, AIME24, and AIME25.
  • Figure 5: Illustrative examples of symbolic abstraction and metadata in VAR-MATH.
  • ...and 3 more figures