Table of Contents
Fetching ...

Towards Autonomous Mathematics Research

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao, Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong

TL;DR

This work presents Aletheia, a math-research agent that iteratively generates, verifies, and revises proofs end-to-end, powered by Gemini Deep Think, an inference-time scaling framework, and extensive tool use. It documents scaling laws, agentic harnesses, and tool integration that enable progression from Olympiad-style problems to research-level mathematics, while introducing an Autonomous Mathematics Levels taxonomy to quantify autonomy and novelty. Across three milestones—an AI-generated eigenweights paper, AI-guided bounds on multivariate independent sets, and an Erdős problem case study—the paper demonstrates both autonomous results and valuable human–AI collaboration, while candidly discussing limitations and the evaluation gap. It argues for transparent standards to document AI-generated mathematics and outlines a path toward responsible, scalable human–AI collaboration in mathematical research. The work thus illuminates both the potential and the current boundaries of AI-assisted mathematics and lays groundwork for a standardized discourse on autonomy and significance.

Abstract

Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest codifying standard levels quantifying autonomy and novelty of AI-assisted results. We conclude with reflections on human-AI collaboration in mathematics.

Towards Autonomous Mathematics Research

TL;DR

This work presents Aletheia, a math-research agent that iteratively generates, verifies, and revises proofs end-to-end, powered by Gemini Deep Think, an inference-time scaling framework, and extensive tool use. It documents scaling laws, agentic harnesses, and tool integration that enable progression from Olympiad-style problems to research-level mathematics, while introducing an Autonomous Mathematics Levels taxonomy to quantify autonomy and novelty. Across three milestones—an AI-generated eigenweights paper, AI-guided bounds on multivariate independent sets, and an Erdős problem case study—the paper demonstrates both autonomous results and valuable human–AI collaboration, while candidly discussing limitations and the evaluation gap. It argues for transparent standards to document AI-generated mathematics and outlines a path toward responsible, scalable human–AI collaboration in mathematical research. The work thus illuminates both the potential and the current boundaries of AI-assisted mathematics and lays groundwork for a standardized discourse on autonomy and significance.

Abstract

Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest codifying standard levels quantifying autonomy and novelty of AI-assisted results. We conclude with reflections on human-AI collaboration in mathematics.
Paper Structure (30 sections, 1 equation, 4 figures, 7 tables)

This paper contains 30 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Schematic of the Aletheia agent.
  • Figure 2: The latest advanced version of Deep Think, as of Jan 2026, has significantly outperformed the IMO-Gold version (Jul 2025) on Olympiad-level problems. The inference-time scaling law also transfers to PhD-level exercises. Aletheia makes further leaps in terms of reasoning quality with lower inference-time compute. All results were graded by human experts.
  • Figure 3: A hallucinated paper from the (truncated) output of a model without internet search capability. The red text refers to a completely fabricated paper.
  • Figure 4: When trained for tool use, the model tends not to fabricate papers, but can still cite results incorrectly. In this example, the referenced paper of Galambos exists, but the claimed "classical result" cannot be found there. Prompt and model output truncated.