Table of Contents
Fetching ...

Towards Robust Mathematical Reasoning

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung

TL;DR

This work introduces IMO-Bench, a comprehensive suite for evaluating robust mathematical reasoning at the IMO level through three tasks: IMO-AnswerBench for short answers, IMO-ProofBench for full proofs, and IMO-GradingBench for automatic proof grading. It details rigorous problem selection and robustification, and develops AnswerAutoGrader and ProofAutoGrader to enable scalable, automated evaluation that correlates highly with human judgments. Empirical results with Gemini Deep Think variants demonstrate strong performance on short-answer tasks and proof-writing, while highlighting gaps in advanced reasoning and the challenges of automatic long-form evaluation. The authors release IMO-Bench to the research community to advance robust reasoning, while candidly discussing limitations such as evaluation cost and potential data contamination.

Abstract

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

Towards Robust Mathematical Reasoning

TL;DR

This work introduces IMO-Bench, a comprehensive suite for evaluating robust mathematical reasoning at the IMO level through three tasks: IMO-AnswerBench for short answers, IMO-ProofBench for full proofs, and IMO-GradingBench for automatic proof grading. It details rigorous problem selection and robustification, and develops AnswerAutoGrader and ProofAutoGrader to enable scalable, automated evaluation that correlates highly with human judgments. Empirical results with Gemini Deep Think variants demonstrate strong performance on short-answer tasks and proof-writing, while highlighting gaps in advanced reasoning and the challenges of automatic long-form evaluation. The authors release IMO-Bench to the research community to advance robust reasoning, while candidly discussing limitations such as evaluation cost and potential data contamination.

Abstract

Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

Paper Structure

This paper contains 47 sections, 4 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: IMO-ProofBench, a benchmark in IMO-Bench, for measuring proof-writing capabilities. We demonstrated high correlations between human and automatic evaluations on a variety of public models, including our IMO-gold model. See $\S$\ref{['sec:proofbench']} and $\S$\ref{['subsec:ipb-autograder']} for more details.
  • Figure 2: Topic distribution by category in IMO-AnswerBench. Number Theory and Combinatorics have the most topics which reflect the broad knowledge required to solve these problems while Geometry is mostly skewed towards angle and sidelength computation problems due to the nature of the short answer benchmark.
  • Figure 3: Grade distribution for solutions in IMO-GradingBench by difficulty levels (IMO-Hard, IMO-Medium, IMO-Easy).
  • Figure 4: Correlation between ProofAutoGrader and human experts on the advanced IMO-ProofBench, evaluated over 170 internal models on our IMO-gold journey.
  • Figure 5: Confusion matrix for ProofAutoGrader vs. human expert grades, over 840 solutions generated by 14 public models (See Table \ref{['tab:imo-proof-bench-manual-result']}).
  • ...and 1 more figures