Table of Contents
Fetching ...

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Ding Chen, Qingchen Yu, Pengyuan Wang, Mengting Hu, Wentao Zhang, Zhengren Wang, Bo Tang, Feiyu Xiong, Xinchi Li, Chao Wang, Minchuan Yang, Zhiyu Li

TL;DR

This work introduces xVerify, an efficient answer verifier tailored for evaluating reasoning-model outputs on objective questions, addressing the difficulty of extracting final answers from long reasoning traces.It combines a formal evaluation framework with a large, multi-source VAR dataset that includes diverse prompts, extensive annotations, and data augmentation to train robust judge models.Empirical results show that xVerify models achieve state-of-the-art accuracy and F1 on test and generalization sets across multiple question types, often surpassing existing evaluation frameworks and judge models while offering lower cost and faster inference than GPT-4o-based judging.The approach demonstrates strong generalization and practical applicability for scalable, automatic reasoning-model evaluation in real-world settings.

Abstract

With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

TL;DR

This work introduces xVerify, an efficient answer verifier tailored for evaluating reasoning-model outputs on objective questions, addressing the difficulty of extracting final answers from long reasoning traces.It combines a formal evaluation framework with a large, multi-source VAR dataset that includes diverse prompts, extensive annotations, and data augmentation to train robust judge models.Empirical results show that xVerify models achieve state-of-the-art accuracy and F1 on test and generalization sets across multiple question types, often surpassing existing evaluation frameworks and judge models while offering lower cost and faster inference than GPT-4o-based judging.The approach demonstrates strong generalization and practical applicability for scalable, automatic reasoning-model evaluation in real-world settings.

Abstract

With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.

Paper Structure

This paper contains 34 sections, 6 equations, 19 figures, 22 tables.

Figures (19)

  • Figure 1: Framework of xVerify: (1) Collecting LLM Responses: aggregate responses from multiple LLMs across datasets covering four question types. (2) VAR Dataset Construction: employ GPT-4o and human annotators for labeling and rechecking, and use data augmentation to refine the dataset. (3) xVerify Judge Pipeline: accurately evaluate multi-component answers from reasoning models on challenging questions.
  • Figure 2: Data Augmentation Pipelines: (1) transformation of multiple-choice options through numbering conversion and noise injection, (2) diversification of mathematical answers via equivalent expression generation, and (3) final answer sentence transformation using prompt rephrasing, symbol wrapping, and gap token insertion.
  • Figure 3: Illustration of the Label Studio Interface.
  • Figure 4: Few-shot prompt for generating LLM responses.
  • Figure 5: Few-shot-restrict prompt for generating LLM responses.
  • ...and 14 more figures