Table of Contents
Fetching ...

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

Eldar Kurtic, Amir Moeini, Dan Alistarh

TL;DR

This work introduces Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving, and shows that it obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level.

Abstract

We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances \emph{dynamically}, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 3rd graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks. The implementation of Mathador-LM benchmark is available at \href{https://github.com/IST-DASLab/Mathador-LM}{github.com/IST-DASLab/Mathador-LM}.

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

TL;DR

This work introduces Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving, and shows that it obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level.

Abstract

We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances \emph{dynamically}, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 3rd graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks. The implementation of Mathador-LM benchmark is available at \href{https://github.com/IST-DASLab/Mathador-LM}{github.com/IST-DASLab/Mathador-LM}.
Paper Structure (13 sections, 1 equation, 5 figures, 6 tables)

This paper contains 13 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparative results on Mathador-LM, MMLU, and GSM8k, across the Llama-3-Instruct (8B and 70B), Phi-3-Instruct (small and medium), and Qwen2-Instruct model families. Interpolation lines show very high scores and clear saturation on MMLU and GSM8k at or beyond the level of specialized humans, whereas on Mathador-LM contemporary models are significantly below the average 3rd grader. MMLU and GSM8K results are obtained from open-llm-leaderboard, mmlu, and qwen.
  • Figure 2: The prompt for Mathador-LM benchmark.
  • Figure 3: An example problem demonstrating both simple and best (Mathador) solutions.
  • Figure 4: Detailed results on Mathador-LM across open and closed models, including confidence intervals. Experiments performed in June 2024.
  • Figure 5: Distribution of scores for several models showing low correlation of higher overall performance with number of high scoring solutions.