Table of Contents
Fetching ...

VeRA: Verified Reasoning Data Augmentation at Scale

Zerui Cheng, Jiashuo Liu, Chunjie Wu, Jianzhu Yao, Pramod Viswanath, Ge Zhang, Wenhao Huang

TL;DR

VeRA tackles the problem of evaluation stagnation by converting benchmarks into executable specifications composed of a template, deterministic generator, and verifier that yields labels. It introduces VeRA-E to create logic-preserving equivalent rewrites and VeRA-H/VeRA-H Pro to generate verifiably harder tasks, enabling scalable, automated, and label-consistent evaluation at frontier scales. Experiments across GSM8K, AIME, Beyond-AIME, AMO-Bench with 16 frontier models show that VeRA-E can reveal contamination and restore discriminability, while VeRA-H Pro yields sharper per-seed deltas and renewed headroom on saturated benchmarks. By reframing benchmarks as renewable infrastructure with verifiable labels, VeRA offers a scalable, auditable path for evaluating progress and even generating training data, with open-source code and datasets to enable broad adoption.

Abstract

The main issue with most evaluation schemes today is their "static" nature: the same problems are reused repeatedly, allowing for memorization, format exploitation, and eventual saturation. To measure genuine AI progress, we need evaluation that is robust by construction, not by post-hoc detection. In response, we propose VeRA (Verified Reasoning Data Augmentation), a framework that converts benchmark problems into executable specifications, comprising (i) a natural language template with placeholder slots, (ii) a coherent generator that samples valid configurations, and (iii) a deterministic verifier that validates parameters and calculates the corresponding correct answers for each configuration. From a single seed problem, VeRA automatically creates unlimited verified variants with reliable labels at near-zero marginal cost without human involvement. VeRA operates in two complementary modes. VeRA-E (equivalent) rewrites problems while keeping the underlying logic intact, useful for detecting memorization versus genuine reasoning. VeRA-H (hardened) systematically increases complexity while remaining verifiable, enabling reliable creation and labelling of fresh difficult tasks at the boundary of intelligence. Evaluating 16 frontier models with VeRA, we find: (i) VeRA-E improves evaluation quality and reveals contamination patterns. (ii) VeRA-H enables human-free generation of hard tasks with reliable labels. (iii) VeRA establishes verified benchmarks as a general paradigm. VeRA reconceptualizes benchmarks from static objects used until exhausted, to executable specifications generating fresh, verified instances on demand, enhancing robustness and cost-effectiveness for evaluation. With VeRA, we envision that evaluation in any verifiable domain can scale indefinitely without sacrificing label integrity. To stimulate future research, we have open-sourced all code and datasets.

VeRA: Verified Reasoning Data Augmentation at Scale

TL;DR

VeRA tackles the problem of evaluation stagnation by converting benchmarks into executable specifications composed of a template, deterministic generator, and verifier that yields labels. It introduces VeRA-E to create logic-preserving equivalent rewrites and VeRA-H/VeRA-H Pro to generate verifiably harder tasks, enabling scalable, automated, and label-consistent evaluation at frontier scales. Experiments across GSM8K, AIME, Beyond-AIME, AMO-Bench with 16 frontier models show that VeRA-E can reveal contamination and restore discriminability, while VeRA-H Pro yields sharper per-seed deltas and renewed headroom on saturated benchmarks. By reframing benchmarks as renewable infrastructure with verifiable labels, VeRA offers a scalable, auditable path for evaluating progress and even generating training data, with open-source code and datasets to enable broad adoption.

Abstract

The main issue with most evaluation schemes today is their "static" nature: the same problems are reused repeatedly, allowing for memorization, format exploitation, and eventual saturation. To measure genuine AI progress, we need evaluation that is robust by construction, not by post-hoc detection. In response, we propose VeRA (Verified Reasoning Data Augmentation), a framework that converts benchmark problems into executable specifications, comprising (i) a natural language template with placeholder slots, (ii) a coherent generator that samples valid configurations, and (iii) a deterministic verifier that validates parameters and calculates the corresponding correct answers for each configuration. From a single seed problem, VeRA automatically creates unlimited verified variants with reliable labels at near-zero marginal cost without human involvement. VeRA operates in two complementary modes. VeRA-E (equivalent) rewrites problems while keeping the underlying logic intact, useful for detecting memorization versus genuine reasoning. VeRA-H (hardened) systematically increases complexity while remaining verifiable, enabling reliable creation and labelling of fresh difficult tasks at the boundary of intelligence. Evaluating 16 frontier models with VeRA, we find: (i) VeRA-E improves evaluation quality and reveals contamination patterns. (ii) VeRA-H enables human-free generation of hard tasks with reliable labels. (iii) VeRA establishes verified benchmarks as a general paradigm. VeRA reconceptualizes benchmarks from static objects used until exhausted, to executable specifications generating fresh, verified instances on demand, enhancing robustness and cost-effectiveness for evaluation. With VeRA, we envision that evaluation in any verifiable domain can scale indefinitely without sacrificing label integrity. To stimulate future research, we have open-sourced all code and datasets.
Paper Structure (116 sections, 6 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 116 sections, 6 equations, 6 figures, 15 tables, 1 algorithm.

Figures (6)

  • Figure 1: VeRA represents a benchmark as executable specifications. Each specification contains (i) a natural-language template, (ii) generator code that samples valid slot assignments, and (iii) verifier code that deterministically checks validity and computes the label. Sampling the specification is GPU/LLM-free and yields fresh instances with labels certified by executing the verifier. VeRA-E preserves the original task (subtraction) while changing surface form; VeRA-H defines a hardened task with an updated verifier.
  • Figure 2: Insights by VeRA. Two AIME seeds are solved correctly, yet the same models fail on VeRA-E variants that preserve the logical constraints but change surface form. This exposes a common failure mode: seed performance can be inflated by familiarity / surface heuristics, while verified-equivalent rewrites reveal brittleness.
  • Figure 3: VeRA difficulty stratification on AIME 2024-II-10. VeRA-E changes surface form and language while preserving the underlying logic. VeRA-H modifies the mathematics---here generalizing perpendicularity to arbitrary angles. VeRA-H Pro picks the single hardest variant from the VeRA-H pool via Judge model ranking, stress-testing the reasoning limits of frontier models.
  • Figure 4: VeRA pipeline. A seed item is compiled into executable specifications that can generate verified families.
  • Figure 5: End-to-end workflow for VeRA-E dataset curation
  • ...and 1 more figures