Table of Contents
Fetching ...

First Proof

Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, Lauren Williams

TL;DR

The paper addresses evaluating AI systems on research-level mathematics by releasing a set of ten questions that arose in real research and remain encrypted until a future date. It proposes a 'first proof' evaluation framework focused on solving well-defined proofs (roughly $5$ pages) across multiple domains, while allowing access to external information to emulate real workflows. The authors describe an experimental protocol using one-shot prompts on models such as GPT-5.1 and Gemini, including data-contamination mitigation and plans to publish encrypted answers later, but treat the work as a foundation rather than a formal benchmark. They argue for a community-driven, carefully graded methodology and outline plans for a second question set and broader testing to evolve toward a formal benchmark for AI in mathematical research.

Abstract

To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.

First Proof

TL;DR

The paper addresses evaluating AI systems on research-level mathematics by releasing a set of ten questions that arose in real research and remain encrypted until a future date. It proposes a 'first proof' evaluation framework focused on solving well-defined proofs (roughly pages) across multiple domains, while allowing access to external information to emulate real workflows. The authors describe an experimental protocol using one-shot prompts on models such as GPT-5.1 and Gemini, including data-contamination mitigation and plans to publish encrypted answers later, but treat the work as a foundation rather than a formal benchmark. They argue for a community-driven, carefully graded methodology and outline plans for a second question set and broader testing to evolve toward a formal benchmark for AI in mathematical research.

Abstract

To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.
Paper Structure (5 sections)

This paper contains 5 sections.