PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Xiangfeng Wang; Hangyu Guo; Yanlin Lai; Mitt Huang; Liang Zhao; Chengyuan Yao; Yinmin Zhang; Qi Han; Xiaoxiao Ren; Chun Yuan; Tong Xu; Zheng Ge; Xiangyu Zhang; Daxin Jiang

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Xiangfeng Wang, Hangyu Guo, Yanlin Lai, Mitt Huang, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Xiaoxiao Ren, Chun Yuan, Tong Xu, Zheng Ge, Xiangyu Zhang, Daxin Jiang

TL;DR

PRIME addresses a key gap in verifiable reasoning by proposing a Process–Outcome Alignment benchmark that evaluates not only final answers but the logical derivations leading to them. It introduces a rigorous, expert-annotated dataset of 2,530 STEM problems across 16 domains and 480 sub-disciplines, along with a five-stage data pipeline to ensure verifiability and difficulty. Empirical results show process-aware verifiers robustly outperform outcome-only baselines and strongly predict downstream RLVR gains, with $R^2$ values exceeding 0.92 in correlating verifier accuracy with policy improvement. The work demonstrates practical impact by guiding verifier selection for RLVR and mitigating reward hacking, while outlining limitations and avenues for future scaling and specialized verifiers. Overall, PRIME provides a principled framework and actionable insights for reliable, multi-step reasoning in STEM tasks.

Abstract

While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 > 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

TL;DR

values exceeding 0.92 in correlating verifier accuracy with policy improvement. The work demonstrates practical impact by guiding verifier selection for RLVR and mitigating reward hacking, while outlining limitations and avenues for future scaling and specialized verifiers. Overall, PRIME provides a principled framework and actionable insights for reliable, multi-step reasoning in STEM tasks.

Abstract

) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.

Paper Structure (27 sections, 2 equations, 9 figures, 4 tables)

This paper contains 27 sections, 2 equations, 9 figures, 4 tables.

Introduction
Related Works
Large Reasoning Models
Verifier and Verifier Evaluation
Benchmark Construction
STEM Data Collection and Filtering
Diverse Response Generation and Difficulty-Aware Selection
Fine-grained Expert Labeling and Evaluation Metrics
Analysis of Performance on Prime
Experimental Settings.
Models.
Evaluation Paradigm.
Performance Analysis.
General Performance Trends.
Reasoning is Important for Verification.
...and 12 more sections

Figures (9)

Figure 1: An illustration of a "lucky guess" where the model arrives at the correct answer via an incorrect derivation.
Figure 2: Overview of the Prime construction pipeline. The process comprises five stages: (a) Extensive STEM data collection with diversity control; (b) Two-stage automated filtering for verifiability and correctness; (c) Heterogeneous response generation using diverse LRMs; (d) Difficulty-aware filtering to select discriminative samples; (e) Fine-grained expert labeling focusing on process-outcome alignment.
Figure 3: Subject distribution of Prime. The inner ring represents the four major STEM disciplines, while the outer ring details the 16 fine-grained sub-domains.
Figure 4: Efficiency vs. Performance. Token usage and accuracy comparison. Red dashed line: open-source (below) vs. commercial (above) models.
Figure 5: Correlation between verifier performance and downstream policy improvement with verifier. The x-axis represents the Overall Accuracy on our Prime benchmark, and the y-axis represents the average score on the benchmarks in Table \ref{['table:rl_results']}
...and 4 more figures

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

TL;DR

Abstract

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (9)