Table of Contents
Fetching ...

StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error

Shu-Xun Yang, Cunxiang Wang, Yidong Wang, Xiaotao Gu, Minlie Huang, Jie Tang

TL;DR

This work proposes a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent, which outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios.

Abstract

Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpretable evaluation outcomes, as well as their failure to assess proof or open-ended problems. To address these issues, we propose a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent. This agent incorporates four internal core operations: logical step segmentation, step scoring, score aggregation and error tree generation, along with four external extension modules: difficulty calibration, simplicity evaluation, completeness validation and format assessment. Furthermore, we introduce StepMathBench, a benchmark comprising 1,000 step-divided process evaluation instances, derived from 200 high-quality math problems grouped by problem type, subject category and difficulty level. Experiments on StepMathBench show that our proposed StepMathAgent outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios. Our data and code are available at https://github.com/SHU-XUN/StepMathAgent.

StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error

TL;DR

This work proposes a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent, which outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios.

Abstract

Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpretable evaluation outcomes, as well as their failure to assess proof or open-ended problems. To address these issues, we propose a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent. This agent incorporates four internal core operations: logical step segmentation, step scoring, score aggregation and error tree generation, along with four external extension modules: difficulty calibration, simplicity evaluation, completeness validation and format assessment. Furthermore, we introduce StepMathBench, a benchmark comprising 1,000 step-divided process evaluation instances, derived from 200 high-quality math problems grouped by problem type, subject category and difficulty level. Experiments on StepMathBench show that our proposed StepMathAgent outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios. Our data and code are available at https://github.com/SHU-XUN/StepMathAgent.

Paper Structure

This paper contains 29 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Critical issues arising from answer evaluation.
  • Figure 2: Distribution of solution scores.
  • Figure 3: The overall architecture of StepMathAgent.
  • Figure 4: Analysis of AvgS and step lengths.
  • Figure 5: Case Study on StepMathAgent. The Chinese text on the right represents the original version, while the corresponding English translation is presented on the left.