Benchmarking Large Language Models for Math Reasoning Tasks

Kathrin Seßler; Yao Rong; Emek Gözlüklü; Enkelejda Kasneci

Benchmarking Large Language Models for Math Reasoning Tasks

Kathrin Seßler, Yao Rong, Emek Gözlüklü, Enkelejda Kasneci

TL;DR

This benchmarking study addresses the challenge of selecting LLMs for mathematical reasoning by evaluating seven prompting strategies across five math datasets on four foundation models, with a focus on accuracy, robustness, and efficiency. It reveals that larger models such as GPT-4o and LLaMA 3-70B achieve strong, strategy-agnostic performance on many tasks, while smaller models benefit more from tailored prompting like Auto CoT; the prompt strategy’s impact varies by model and dataset. The work highlights important trade-offs between cost, time, and accuracy, showing that Auto CoT often delivers a favorable balance for open models, while GPT-3.5 with Zero-Shot CoT can be cost-effective for simpler tasks. By open-sourcing benchmarking code and providing a multi-faceted evaluation framework, the paper offers practical guidance for deploying LLMs in educational and professional math-reasoning settings and establishes a platform for future cross-model benchmarking in this area.

Abstract

The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance, such as in educational settings. Despite the variety of datasets and in-context learning algorithms designed to improve the ability of LLMs to automate mathematical problem solving, the lack of comprehensive benchmarking across different datasets makes it complicated to select an appropriate model for specific tasks. In this project, we present a benchmark that fairly compares seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models. Furthermore, we explore the trade-off between efficiency and performance, highlighting the practical applications of LLMs for mathematical reasoning. Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy, while for smaller models the in-context learning approach significantly influences the performance. Moreover, the optimal prompt depends on the chosen foundation model. We open-source our benchmark code to support the integration of additional models in future research.

Benchmarking Large Language Models for Math Reasoning Tasks

TL;DR

Abstract

Paper Structure (18 sections, 21 equations, 33 figures, 3 tables)

This paper contains 18 sections, 21 equations, 33 figures, 3 tables.

Introduction
Related Work
Benchmarking Details
Datasets for Mathematical Reasoning
Foundation Models
Methods for Mathematical Reasoning
Performance Metrics
Experimental Results
Experimental setup
Robust Performance
Foundation Models.
Prompt Strategies.
Datasets.
Efficiency
Proficiency
...and 3 more sections

Figures (33)

Figure 1: Mathematical Reasoning Task from the MATH dataset. The question belongs to the Algebra category at level 1, and the predicted answers were generated by the different foundation models using CoT approach.
Figure 2: Overview of the mathematical reasoning methods evaluated in the benchmark, categorized into three groups: Prompt Engineering, Process Optimization, and External Engine. The primary procedure for each method is outlined. Symbols denote the presence of few-shot examples (), the use of an external engine (), and the number of refinement iterations required ().
Figure 3: Trade-off between performance and computational costs on the GSM8K dataset. The y-axis represents the pass@3 and the x-axis the computational costs. On the left side, the LLaMA foundation models are shown, based on the elapsed computation time, and on the right side, the GPT foundation models are compared based on the costs for the API calls.
Figure 4: Detailed analysis of the pass@3 metric using CoT separated by the different question categories and levels in the MATH dataset. On the left side, the LLaMA 3-8B results are shown, on the right side the outcomes using LLaMA 3-70B.
Figure 5: Further examples from the MATH dataset. The question belongs to the category Algebra, level 5, and the predicted answers were generated by LLaMA 3-8B and LLaMA 3-70B using CoT approach.
...and 28 more figures

Benchmarking Large Language Models for Math Reasoning Tasks

TL;DR

Abstract

Benchmarking Large Language Models for Math Reasoning Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (33)