Table of Contents
Fetching ...

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Yan Liu, Renren Jin, Ling Shi, Zheng Yao, Deyi Xiong

TL;DR

FineMath presents a fine-grained, category-based benchmark to evaluate Chinese LLMs on elementary mathematics. It collects 1,584 Math Word Problems, categorizes them into 17 concepts, annotates the required reasoning steps, and transforms problems into MCQs to study both generation and selection-based evaluation. The paper analyzes data collection, annotation, contamination with Ape210K, and prompts and evaluation methods, revealing significant effects of prompts and task form on results. Key findings show GPT-4 achieving top performance while several Chinese LLMs remain below peak capability, and emphasize the need for contamination filtering and generation-based evaluation for fair, robust assessments. Overall, FineMath serves as a practical benchmark to diagnose conceptual understanding and multi-step reasoning in Chinese LLMs and guides future advancement and evaluation standards.

Abstract

To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

TL;DR

FineMath presents a fine-grained, category-based benchmark to evaluate Chinese LLMs on elementary mathematics. It collects 1,584 Math Word Problems, categorizes them into 17 concepts, annotates the required reasoning steps, and transforms problems into MCQs to study both generation and selection-based evaluation. The paper analyzes data collection, annotation, contamination with Ape210K, and prompts and evaluation methods, revealing significant effects of prompts and task form on results. Key findings show GPT-4 achieving top performance while several Chinese LLMs remain below peak capability, and emphasize the need for contamination filtering and generation-based evaluation for fair, robust assessments. Overall, FineMath serves as a practical benchmark to diagnose conceptual understanding and multi-step reasoning in Chinese LLMs and guides future advancement and evaluation standards.

Abstract

To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.
Paper Structure (22 sections, 4 figures, 10 tables)

This paper contains 22 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: FineMath can evaluate LLMs' mathematical ability from three aspects: accuracy of understanding abstract mathematical concepts, accuracy of reasoning, and overall accuracy.
  • Figure 2: Contamination analysis. The overlap rate between FineMath and the training sets of Ape210k.
  • Figure 3: Main results of different evaluated LLMs on our dataset (under Prompt 0).
  • Figure 4: Results in terms of the number of mathematical reasoning steps (under Prompt 0).