Table of Contents
Fetching ...

Putnam-like dataset summary: LLMs as mathematical competition contestants

Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, Mateusz Topolewski

TL;DR

The paper assesses LLMs on a Putnam-like benchmark of 96 problems graded on a $0$–$10$ rubric by human experts. The approach analyzes performance across problem levels, categories, and models, with cross-model comparisons and significance testing. Key findings show Gemini-2.5-pro-03-25 and Gemini-2.5-flash-04-17 delivering the most complete, justified proofs, while r1 underperforms and others vary in proof style. The benchmark appears slightly easier than the actual Putnam, offering insight into LLMs' strengths and limitations in rigorous mathematical reasoning and informing future evaluation and prompting strategies.

Abstract

In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions of LLMs. We analyse the performance of models on this set of problems to verify their ability to solve problems from mathematical contests.

Putnam-like dataset summary: LLMs as mathematical competition contestants

TL;DR

The paper assesses LLMs on a Putnam-like benchmark of 96 problems graded on a rubric by human experts. The approach analyzes performance across problem levels, categories, and models, with cross-model comparisons and significance testing. Key findings show Gemini-2.5-pro-03-25 and Gemini-2.5-flash-04-17 delivering the most complete, justified proofs, while r1 underperforms and others vary in proof style. The benchmark appears slightly easier than the actual Putnam, offering insight into LLMs' strengths and limitations in rigorous mathematical reasoning and informing future evaluation and prompting strategies.

Abstract

In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions of LLMs. We analyse the performance of models on this set of problems to verify their ability to solve problems from mathematical contests.

Paper Structure

This paper contains 7 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Total distribution of grades
  • Figure 2: Total distribution of grades in the Putnam competition (best contestants from the last 4 years)
  • Figure 3: Distribution of grades by problem level
  • Figure 4: Distribution of grades by category of the problem
  • Figure 5: Distribution of grades by model
  • ...and 6 more figures