First Proof

Mohammed Abouzaid; Andrew J. Blumberg; Martin Hairer; Joe Kileel; Tamara G. Kolda; Paul D. Nelson; Daniel Spielman; Nikhil Srivastava; Rachel Ward; Shmuel Weinberger; Lauren Williams

First Proof

Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, Lauren Williams

TL;DR

The paper addresses evaluating AI systems on research-level mathematics by releasing a set of ten questions that arose in real research and remain encrypted until a future date. It proposes a 'first proof' evaluation framework focused on solving well-defined proofs (roughly $5$ pages) across multiple domains, while allowing access to external information to emulate real workflows. The authors describe an experimental protocol using one-shot prompts on models such as GPT-5.1 and Gemini, including data-contamination mitigation and plans to publish encrypted answers later, but treat the work as a foundation rather than a formal benchmark. They argue for a community-driven, carefully graded methodology and outline plans for a second question set and broader testing to evolve toward a formal benchmark for AI in mathematical research.

Abstract

To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.

First Proof

TL;DR

pages) across multiple domains, while allowing access to external information to emulate real workflows. The authors describe an experimental protocol using one-shot prompts on models such as GPT-5.1 and Gemini, including data-contamination mitigation and plans to publish encrypted answers later, but treat the work as a foundation rather than a formal benchmark. They argue for a community-driven, carefully graded methodology and outline plans for a second question set and broader testing to evolve toward a formal benchmark for AI in mathematical research.

Abstract

Paper Structure (5 sections)

This paper contains 5 sections.

Introduction
The questions
Related work
Implementation details
Discussion

First Proof

TL;DR

Abstract

First Proof

Authors

TL;DR

Abstract

Table of Contents