Table of Contents
Fetching ...

In between myth and reality: AI for math -- a case study in category theory

Răzvan Diaconescu

TL;DR

This paper investigates how contemporary AI systems perform on a math research task by conducting a case study in category theory, focusing on inclusion systems and a nontrivial pullback exercise. It contrasts o3-mini and Grok-3, highlighting how retrieval of definitions, mathematical language, and reasoning quality affect results; overall, AI yielded partial validity for one part of the proof but failed to produce a correct, complete solution due to definitional errors and notation issues. The authors conclude that current AI cannot autonomously perform mathematical understanding and rigorous argumentation at the research level, urging reliance on human literature and complementary formal tools. Practically, they propose architectural safeguards for AI systems, and advocate combining soft AI assistance with hard automated provers such as Vampire, Spass, or E-2.3-2019 to verifiably check results.

Abstract

Recently, there is an increasing interest in understanding the performance of AI systems in solving math problems. A multitude of tests have been performed, with mixed conclusions. In this paper we discuss an experiment we have made in the direction of mathematical research, with two of the most prominent contemporary AI systems. One of the objective of this experiment is to get an understanding of how AI systems can assist mathematical research. Another objective is to support the AI systems developers by formulating suggestions for directions of improvement.

In between myth and reality: AI for math -- a case study in category theory

TL;DR

This paper investigates how contemporary AI systems perform on a math research task by conducting a case study in category theory, focusing on inclusion systems and a nontrivial pullback exercise. It contrasts o3-mini and Grok-3, highlighting how retrieval of definitions, mathematical language, and reasoning quality affect results; overall, AI yielded partial validity for one part of the proof but failed to produce a correct, complete solution due to definitional errors and notation issues. The authors conclude that current AI cannot autonomously perform mathematical understanding and rigorous argumentation at the research level, urging reliance on human literature and complementary formal tools. Practically, they propose architectural safeguards for AI systems, and advocate combining soft AI assistance with hard automated provers such as Vampire, Spass, or E-2.3-2019 to verifiably check results.

Abstract

Recently, there is an increasing interest in understanding the performance of AI systems in solving math problems. A multitude of tests have been performed, with mixed conclusions. In this paper we discuss an experiment we have made in the direction of mathematical research, with two of the most prominent contemporary AI systems. One of the objective of this experiment is to get an understanding of how AI systems can assist mathematical research. Another objective is to support the AI systems developers by formulating suggestions for directions of improvement.

Paper Structure

This paper contains 24 sections, 13 equations.

Theorems & Definitions (1)

  • Definition 2.1