VeRA: Verified Reasoning Data Augmentation at Scale
Zerui Cheng, Jiashuo Liu, Chunjie Wu, Jianzhu Yao, Pramod Viswanath, Ge Zhang, Wenhao Huang
TL;DR
VeRA tackles the problem of evaluation stagnation by converting benchmarks into executable specifications composed of a template, deterministic generator, and verifier that yields labels. It introduces VeRA-E to create logic-preserving equivalent rewrites and VeRA-H/VeRA-H Pro to generate verifiably harder tasks, enabling scalable, automated, and label-consistent evaluation at frontier scales. Experiments across GSM8K, AIME, Beyond-AIME, AMO-Bench with 16 frontier models show that VeRA-E can reveal contamination and restore discriminability, while VeRA-H Pro yields sharper per-seed deltas and renewed headroom on saturated benchmarks. By reframing benchmarks as renewable infrastructure with verifiable labels, VeRA offers a scalable, auditable path for evaluating progress and even generating training data, with open-source code and datasets to enable broad adoption.
Abstract
The main issue with most evaluation schemes today is their "static" nature: the same problems are reused repeatedly, allowing for memorization, format exploitation, and eventual saturation. To measure genuine AI progress, we need evaluation that is robust by construction, not by post-hoc detection. In response, we propose VeRA (Verified Reasoning Data Augmentation), a framework that converts benchmark problems into executable specifications, comprising (i) a natural language template with placeholder slots, (ii) a coherent generator that samples valid configurations, and (iii) a deterministic verifier that validates parameters and calculates the corresponding correct answers for each configuration. From a single seed problem, VeRA automatically creates unlimited verified variants with reliable labels at near-zero marginal cost without human involvement. VeRA operates in two complementary modes. VeRA-E (equivalent) rewrites problems while keeping the underlying logic intact, useful for detecting memorization versus genuine reasoning. VeRA-H (hardened) systematically increases complexity while remaining verifiable, enabling reliable creation and labelling of fresh difficult tasks at the boundary of intelligence. Evaluating 16 frontier models with VeRA, we find: (i) VeRA-E improves evaluation quality and reveals contamination patterns. (ii) VeRA-H enables human-free generation of hard tasks with reliable labels. (iii) VeRA establishes verified benchmarks as a general paradigm. VeRA reconceptualizes benchmarks from static objects used until exhausted, to executable specifications generating fresh, verified instances on demand, enhancing robustness and cost-effectiveness for evaluation. With VeRA, we envision that evaluation in any verifiable domain can scale indefinitely without sacrificing label integrity. To stimulate future research, we have open-sourced all code and datasets.
