VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

Jian Yao; Ran Cheng; Kay Chen Tan

VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

Jian Yao, Ran Cheng, Kay Chen Tan

TL;DR

VAR-MATH introduces a symbolic multi-instantiation evaluation to probe true mathematical reasoning in LLMs, addressing contamination and fragility of single-instance benchmarks. By transforming fixed problems from $AMC23$, $AIME24$, and $AIME25$ into parameterized templates and testing across multiple variants, the framework enforces reasoning consistency and uses loose and strict metrics with bootstrap to stabilize estimates. empirical results show RL-finetuned models suffer substantial drops on VAR-MATH, revealing reliance on memorization and superficial cues rather than robust generalization. The work highlights the need for contamination-resistant evaluation in reasoning tasks and offers a generalizable paradigm for rigorous assessment across domains.

Abstract

Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of LLMs, as measured by standard benchmarks. Yet these gains often persist even when models are trained with flawed signals, such as random or inverted rewards. This raises a fundamental question: do such improvements reflect genuine reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To answer this question, we adopt an evaluation-centric perspective and highlight two critical shortcomings in existing protocols. First, benchmark contamination arises because test problems are publicly available, thereby increasing the risk of data leakage. Second, evaluation fragility results from reliance on single-instance assessments, which are sensitive to stochastic outputs and fail to capture reasoning consistency. These limitations suggest the need for a new evaluation paradigm that can probe reasoning ability beyond memorization and one-off success. As response, we propose VAR-MATH, a symbolic evaluation framework that converts fixed numerical problems into parameterized templates and requires models to solve multiple instantiations of each. This design enforces consistency across structurally equivalent variants, mitigates contamination, and enhances robustness through bootstrapped metrics. We apply VAR-MATH to transform three popular benchmarks, AMC23, AIME24, and AIME25, into their symbolic counterparts, VAR-AMC23, VAR-AIME24, and VAR-AIME25. Experimental results show substantial performance drops for RL-trained models on these variabilized benchmarks, especially for smaller models, with average declines of 47.9\% on AMC23, 58.8\% on AIME24, and 72.9\% on AIME25. These findings indicate that some existing RL methods rely on superficial heuristics and fail to generalize beyond specific numerical forms.

VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

TL;DR

Abstract

VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)