MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Boning Zhang; Chengxi Li; Kai Fan

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Boning Zhang, Chengxi Li, Kai Fan

TL;DR

MARIO Eval tackles inconsistent automatic evaluation for mathematical reasoning by introducing a two-stage toolkit that leverages a Python CAS for numerical accuracy and an optional LLM to disambiguate answer types and judge equivalence. It formalizes a type system aligned with Python/SymPy and a design pattern where rule-based classification is enhanced by LLM-assisted analysis, used only when necessary to curb hallucinations. Empirical results on MATH and GK2023 (including GK2023-ToRA) show near $97\%$ equivalence accuracy with the basic design and modest gains with LLM integration, with case studies demonstrating robustness to formatting, units, and set/interval representations. The work enables more reliable cross-dataset comparisons in mathematical reasoning and provides public datasets and code for reproducibility.

Abstract

Large language models (LLMs) have been explored in a variety of reasoning tasks including solving of mathematical problems. Each math dataset typically includes its own specially designed evaluation script, which, while suitable for its intended use, lacks generalizability across different datasets. Consequently, updates and adaptations to these evaluation tools tend to occur without being systematically reported, leading to inconsistencies and obstacles to fair comparison across studies. To bridge this gap, we introduce a comprehensive mathematical evaluation toolkit that not only utilizes a python computer algebra system (CAS) for its numerical accuracy, but also integrates an optional LLM, known for its considerable natural language processing capabilities. To validate the effectiveness of our toolkit, we manually annotated two distinct datasets. Our experiments demonstrate that the toolkit yields more robust evaluation results compared to prior works, even without an LLM. Furthermore, when an LLM is incorporated, there is a notable enhancement. The code for our method will be made available at \url{https://github.com/MARIO-Math-Reasoning/math_evaluation}.

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

TL;DR

equivalence accuracy with the basic design and modest gains with LLM integration, with case studies demonstrating robustness to formatting, units, and set/interval representations. The work enables more reliable cross-dataset comparisons in mathematical reasoning and provides public datasets and code for reproducibility.

Abstract

Paper Structure (13 sections, 1 equation, 2 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 1 equation, 2 figures, 4 tables, 1 algorithm.

Introduction
Main Framework
Type Definitions
Design Pattern
Datasets and Setups
Main Results
Ablation Studies
Related Works
Conclusion
Appendix
Case Study on our MATH
Case Study on our GAOKAO
Case Study on MATH of ToRA

Figures (2)

Figure 1: Most previous evaluation tools judge correctness solely based on the answer, while ours also takes into account the answer type of the question implied.
Figure 2: Solution accuracy with different toolkits. Left: MATH. Right: GK2023

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

TL;DR

Abstract

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Authors

TL;DR

Abstract

Table of Contents

Figures (2)