Table of Contents
Fetching ...

The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

Bianca Raimondi, Francesco Pivi, Davide Evangelista, Maurizio Gabbrielli

TL;DR

CompMath-MCQ is introduced, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting, and adopts a multiple-choice format, enabling objective, reproducible, and bias-free evaluation through lm_eval library.

Abstract

The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: https://github.com/biancaraimondi/CompMath-MCQ.git

The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

TL;DR

CompMath-MCQ is introduced, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting, and adopts a multiple-choice format, enabling objective, reproducible, and bias-free evaluation through lm_eval library.

Abstract

The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: https://github.com/biancaraimondi/CompMath-MCQ.git
Paper Structure (11 sections, 6 equations, 4 figures, 3 tables)

This paper contains 11 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Distribution of questions across mathematical topics in the CompMath-MCQ dataset. The dataset contains 1,500 questions spanning five core areas.
  • Figure 2: Distribution (a) and empirical cumulative distribution function (b) of the per-question error rate, defined as the fraction of models predicting an incorrect answer.
  • Figure 3: (a) Per-question error rate versus wrong-answer consensus. Points in the upper-right region indicate questions for which multiple models fail in a consistent manner, signaling potential ambiguity or mislabeling. (b) Error rate versus wrong-answer consensus with statistically anomalous questions highlighted (binomial test, $p<0.01$). The darker the spot, the more overlapping questions there are in that area.
  • Figure 4: Comparison of LLMs accuracy across categories on CompMath-MCQ.