HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Jingxuan Fan; Sarah Martinson; Erik Y. Wang; Kaylie Hausknecht; Jonah Brenner; Danxian Liu; Nianli Peng; Corey Wang; Michael P. Brenner

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner

TL;DR

HARDMath introduces a large, algorithmically generated benchmark aimed at graduate-level applied mathematics, emphasizing asymptotic and approximation methods. It combines four problem classes with seven types plus 40 word problems and validates solutions against numerical ground truths, enabling scalable evaluation of LLMs’ mathematical reasoning and tool use. Across 366-problem HARDMath-mini evaluations, even top models show substantial gaps compared to existing benchmarks, highlighting the benchmark’s difficulty and the need for improved reasoning and external-tool integration. The dataset’s automatic generation, ground-truth validation, and context-rich word problems offer a practical framework to advance LLM capabilities in analytical approximation and multi-modal problem solving.

Abstract

Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical approximation techniques. These problems demand a combination of mathematical reasoning, computational tools, and subjective judgment, making them difficult for LLMs. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level applied math problems and underscore the importance of datasets like HARDMath to advance mathematical abilities of LLMs.

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

TL;DR

Abstract

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)