First Proof
Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, Lauren Williams
TL;DR
The paper addresses evaluating AI systems on research-level mathematics by releasing a set of ten questions that arose in real research and remain encrypted until a future date. It proposes a 'first proof' evaluation framework focused on solving well-defined proofs (roughly $5$ pages) across multiple domains, while allowing access to external information to emulate real workflows. The authors describe an experimental protocol using one-shot prompts on models such as GPT-5.1 and Gemini, including data-contamination mitigation and plans to publish encrypted answers later, but treat the work as a foundation rather than a formal benchmark. They argue for a community-driven, carefully graded methodology and outline plans for a second question set and broader testing to evolve toward a formal benchmark for AI in mathematical research.
Abstract
To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.
