Putnam-like dataset summary: LLMs as mathematical competition contestants
Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, Mateusz Topolewski
TL;DR
The paper assesses LLMs on a Putnam-like benchmark of 96 problems graded on a $0$–$10$ rubric by human experts. The approach analyzes performance across problem levels, categories, and models, with cross-model comparisons and significance testing. Key findings show Gemini-2.5-pro-03-25 and Gemini-2.5-flash-04-17 delivering the most complete, justified proofs, while r1 underperforms and others vary in proof style. The benchmark appears slightly easier than the actual Putnam, offering insight into LLMs' strengths and limitations in rigorous mathematical reasoning and informing future evaluation and prompting strategies.
Abstract
In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions of LLMs. We analyse the performance of models on this set of problems to verify their ability to solve problems from mathematical contests.
