Thinking Machines: Mathematical Reasoning in the Age of LLMs

Andrea Asperti; Alberto Naibo; Claudio Sacerdoti Coen

Thinking Machines: Mathematical Reasoning in the Age of LLMs

Andrea Asperti, Alberto Naibo, Claudio Sacerdoti Coen

TL;DR

The paper surveys the state of mathematical reasoning with large language models, contrasting traditional natural-language mathematics with fully formalized proofs and highlighting why proving is harder than coding. It reviews training pipelines (pretraining, SFT, reward models, RL) and inference-time strategies, and assesses a broad suite of benchmarks (AIME 2024, PGPS9K, miniF2F, FrontierMath) to illustrate current capabilities and limits. A core thread analyzes autoformalization as a bridge between domains, revealing both its promise and brittleness, and discusses state representation and feedback in formal proof settings. The authors advocate hybrid, tool-augmented approaches and call for evaluation methods that probe persistent state and long-horizon reasoning, suggesting that a unified framework for traditional and formal mathematics is achievable with advances in interfaces, training signals, and benchmarks.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in structured reasoning and symbolic tasks, with coding emerging as a particularly successful application. This progress has naturally motivated efforts to extend these models to mathematics, both in its traditional form, expressed through natural-style mathematical language, and in its formalized counterpart, expressed in a symbolic syntax suitable for automatic verification. Yet, despite apparent parallels between programming and proof construction, advances in formalized mathematics have proven significantly more challenging. This gap raises fundamental questions about the nature of reasoning in current LLM architectures, the role of supervision and feedback, and the extent to which such models maintain an internal notion of computational or deductive state. In this article, we review the current state-of-the-art in mathematical reasoning with LLMs, focusing on recent models and benchmarks. We explore three central issues at the intersection of machine learning and mathematical cognition: (i) the trade-offs between traditional and formalized mathematics as training and evaluation domains; (ii) the structural and methodological reasons why proof synthesis remains more brittle than code generation; and (iii) whether LLMs genuinely represent or merely emulate a notion of evolving logical state. Our goal is not to draw rigid distinctions but to clarify the present boundaries of these systems and outline promising directions for their extension.

Thinking Machines: Mathematical Reasoning in the Age of LLMs

TL;DR

Abstract

Thinking Machines: Mathematical Reasoning in the Age of LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)