Table of Contents
Fetching ...

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner

TL;DR

HARDMath introduces a large, algorithmically generated benchmark aimed at graduate-level applied mathematics, emphasizing asymptotic and approximation methods. It combines four problem classes with seven types plus 40 word problems and validates solutions against numerical ground truths, enabling scalable evaluation of LLMs’ mathematical reasoning and tool use. Across 366-problem HARDMath-mini evaluations, even top models show substantial gaps compared to existing benchmarks, highlighting the benchmark’s difficulty and the need for improved reasoning and external-tool integration. The dataset’s automatic generation, ground-truth validation, and context-rich word problems offer a practical framework to advance LLM capabilities in analytical approximation and multi-modal problem solving.

Abstract

Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational tools, and subjective judgment, making them difficult for LLMs. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level applied math problems and underscore the importance of datasets like HARDMath to advance mathematical abilities of LLMs.

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

TL;DR

HARDMath introduces a large, algorithmically generated benchmark aimed at graduate-level applied mathematics, emphasizing asymptotic and approximation methods. It combines four problem classes with seven types plus 40 word problems and validates solutions against numerical ground truths, enabling scalable evaluation of LLMs’ mathematical reasoning and tool use. Across 366-problem HARDMath-mini evaluations, even top models show substantial gaps compared to existing benchmarks, highlighting the benchmark’s difficulty and the need for improved reasoning and external-tool integration. The dataset’s automatic generation, ground-truth validation, and context-rich word problems offer a practical framework to advance LLM capabilities in analytical approximation and multi-modal problem solving.

Abstract

Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational tools, and subjective judgment, making them difficult for LLMs. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level applied math problems and underscore the importance of datasets like HARDMath to advance mathematical abilities of LLMs.

Paper Structure

This paper contains 45 sections, 109 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Breakdowns of the HARDMath-mini (left) and the HARDMath (right) datasets.
  • Figure 2: Flowchart detailing the data generation procedure for HARDMath problems.
  • Figure 3: Percentage of correct, partial, and incorrect responses for o1-mini, GPT-4 and Llama3, prompting conditions, and problem types.
  • Figure 4: GPT-4 error modes for problem type Roots at 0 vs. 5 shot CoT prompting
  • Figure 5: Visual comparison of numerical and approximate analytical solutions to a sample Laplace integral problem for solution verification.
  • ...and 4 more figures