Table of Contents
Fetching ...

HARP: A challenging human-annotated math reasoning benchmark

Albert S. Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, Aaditya K. Singh

TL;DR

HARP introduces a comprehensive, human-annotated math reasoning benchmark drawn from major AoPS contests, totaling $5{,}409$ problems with a rich annotation suite (difficulty, subject, multiple solutions, and choices). It provides two primary evaluation splits: a default short-answer set of $4{,}780$ automatically checkable items and a separate multiple-choice set of $4{,}110$ problems, plus open-source scraping and evaluation code. Across ten frontier models from five families, HARP reveals persistent gaps on hardest problems, with notable findings that chain-of-thought length correlates with problem difficulty and that multiple-choice prompts can yield higher accuracy than short answers. The dataset’s design—including multiple human solutions, robust answer checking via SymPy, and scrambled-choice analyses—offers a versatile platform for diagnosing math reasoning and guiding future improvements in model-based problem solving. Overall, HARP exposes substantial room for progress in high-difficulty math reasoning and provides a public, extensible resource for ongoing research and benchmarking.

Abstract

Math reasoning is becoming an ever increasing area of focus as we scale large language models. However, even the previously-toughest evals like MATH are now close to saturated by frontier models (90.0% for o1-mini and 86.5% for Gemini 1.5 Pro). We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO). Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy). These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro). Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written, ground-truth solutions per problem, offering new avenues of research that we explore briefly. We report evaluations for many frontier models and share some interesting analyses, such as demonstrating that frontier models across families intrinsically scale their inference-time compute for more difficult problems. Finally, we open source all code used for dataset construction (including scraping) and all code for evaluation (including answer checking) to enable future research at: https://github.com/aadityasingh/HARP.

HARP: A challenging human-annotated math reasoning benchmark

TL;DR

HARP introduces a comprehensive, human-annotated math reasoning benchmark drawn from major AoPS contests, totaling problems with a rich annotation suite (difficulty, subject, multiple solutions, and choices). It provides two primary evaluation splits: a default short-answer set of automatically checkable items and a separate multiple-choice set of problems, plus open-source scraping and evaluation code. Across ten frontier models from five families, HARP reveals persistent gaps on hardest problems, with notable findings that chain-of-thought length correlates with problem difficulty and that multiple-choice prompts can yield higher accuracy than short answers. The dataset’s design—including multiple human solutions, robust answer checking via SymPy, and scrambled-choice analyses—offers a versatile platform for diagnosing math reasoning and guiding future improvements in model-based problem solving. Overall, HARP exposes substantial room for progress in high-difficulty math reasoning and provides a public, extensible resource for ongoing research and benchmarking.

Abstract

Math reasoning is becoming an ever increasing area of focus as we scale large language models. However, even the previously-toughest evals like MATH are now close to saturated by frontier models (90.0% for o1-mini and 86.5% for Gemini 1.5 Pro). We introduce HARP, Human Annotated Reasoning Problems (for Math), consisting of 5,409 problems from the US national math competitions (A(J)HSME, AMC, AIME, USA(J)MO). Of these, 4,780 have answers that are automatically check-able (with libraries such as SymPy). These problems range six difficulty levels, with frontier models performing relatively poorly on the hardest bracket of 197 problems (average accuracy 41.1% for o1-mini, and 9.6% for Gemini 1.5 Pro). Our dataset also features multiple choices (for 4,110 problems) and an average of two human-written, ground-truth solutions per problem, offering new avenues of research that we explore briefly. We report evaluations for many frontier models and share some interesting analyses, such as demonstrating that frontier models across families intrinsically scale their inference-time compute for more difficult problems. Finally, we open source all code used for dataset construction (including scraping) and all code for evaluation (including answer checking) to enable future research at: https://github.com/aadityasingh/HARP.

Paper Structure

This paper contains 38 sections, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Accuracy of various LLMs on MATH and a) our full dataset or b) our dataset restricted to the two highest difficulties. See Table \ref{['tab:overall-shortans']} for numerical accuracies. We can see that improvement on MATH does not correspond to increases on the highest difficulties of HARP for most models, indicating possible overfitting to easier problems from MATH. c) An example problem with annotations, choices, and multiple solutions from our dataset. All 10 models we evaluated did not get this problem correct.
  • Figure 2: Dataset summary. Top row shows the breakdown of the 5,409 questions that we scraped from A(J)HSME, AMC, AIME, and USA(J)MO contests. The pie plots in the bottom row indicate the breakdown of the 4,780 short answer questions (the "default split" of HARP) according to difficulty level, subject, and # of human-written ground-truth solutions.
  • Figure 3: Per-difficulty (left) and Per-subject (right) accuracy of various LLMs on HARP. See Table \ref{['tab:overall-shortans']} for numerical accuracies.
  • Figure 4: Distribution of number of output tokens categorized by level. The final column shows the distribution of human-written solutions. We use the GPT-4o tokenizer (via tiktoken) to compute the number of tokens in human-written solutions.
  • Figure 5: $\text{Pass}@k$ and $\text{maj}@k$ performance across various values of $k$ on Gemini 1.5 Pro using $\text{temperature} = 1$ and $\text{top\_p} = 0.95$. Error bars on $\text{maj}@k$ indicate 95% confidence intervals, calculated over 5 re-orderings of samples.
  • ...and 12 more figures