Table of Contents
Fetching ...

Thinking Machines: Mathematical Reasoning in the Age of LLMs

Andrea Asperti, Alberto Naibo, Claudio Sacerdoti Coen

TL;DR

The paper surveys the state of mathematical reasoning with large language models, contrasting traditional natural-language mathematics with fully formalized proofs and highlighting why proving is harder than coding. It reviews training pipelines (pretraining, SFT, reward models, RL) and inference-time strategies, and assesses a broad suite of benchmarks (AIME 2024, PGPS9K, miniF2F, FrontierMath) to illustrate current capabilities and limits. A core thread analyzes autoformalization as a bridge between domains, revealing both its promise and brittleness, and discusses state representation and feedback in formal proof settings. The authors advocate hybrid, tool-augmented approaches and call for evaluation methods that probe persistent state and long-horizon reasoning, suggesting that a unified framework for traditional and formal mathematics is achievable with advances in interfaces, training signals, and benchmarks.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in structured reasoning and symbolic tasks, with coding emerging as a particularly successful application. This progress has naturally motivated efforts to extend these models to mathematics, both in its traditional form, expressed through natural-style mathematical language, and in its formalized counterpart, expressed in a symbolic syntax suitable for automatic verification. Yet, despite apparent parallels between programming and proof construction, advances in formalized mathematics have proven significantly more challenging. This gap raises fundamental questions about the nature of reasoning in current LLM architectures, the role of supervision and feedback, and the extent to which such models maintain an internal notion of computational or deductive state. In this article, we review the current state-of-the-art in mathematical reasoning with LLMs, focusing on recent models and benchmarks. We explore three central issues at the intersection of machine learning and mathematical cognition: (i) the trade-offs between traditional and formalized mathematics as training and evaluation domains; (ii) the structural and methodological reasons why proof synthesis remains more brittle than code generation; and (iii) whether LLMs genuinely represent or merely emulate a notion of evolving logical state. Our goal is not to draw rigid distinctions but to clarify the present boundaries of these systems and outline promising directions for their extension.

Thinking Machines: Mathematical Reasoning in the Age of LLMs

TL;DR

The paper surveys the state of mathematical reasoning with large language models, contrasting traditional natural-language mathematics with fully formalized proofs and highlighting why proving is harder than coding. It reviews training pipelines (pretraining, SFT, reward models, RL) and inference-time strategies, and assesses a broad suite of benchmarks (AIME 2024, PGPS9K, miniF2F, FrontierMath) to illustrate current capabilities and limits. A core thread analyzes autoformalization as a bridge between domains, revealing both its promise and brittleness, and discusses state representation and feedback in formal proof settings. The authors advocate hybrid, tool-augmented approaches and call for evaluation methods that probe persistent state and long-horizon reasoning, suggesting that a unified framework for traditional and formal mathematics is achievable with advances in interfaces, training signals, and benchmarks.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in structured reasoning and symbolic tasks, with coding emerging as a particularly successful application. This progress has naturally motivated efforts to extend these models to mathematics, both in its traditional form, expressed through natural-style mathematical language, and in its formalized counterpart, expressed in a symbolic syntax suitable for automatic verification. Yet, despite apparent parallels between programming and proof construction, advances in formalized mathematics have proven significantly more challenging. This gap raises fundamental questions about the nature of reasoning in current LLM architectures, the role of supervision and feedback, and the extent to which such models maintain an internal notion of computational or deductive state. In this article, we review the current state-of-the-art in mathematical reasoning with LLMs, focusing on recent models and benchmarks. We explore three central issues at the intersection of machine learning and mathematical cognition: (i) the trade-offs between traditional and formalized mathematics as training and evaluation domains; (ii) the structural and methodological reasons why proof synthesis remains more brittle than code generation; and (iii) whether LLMs genuinely represent or merely emulate a notion of evolving logical state. Our goal is not to draw rigid distinctions but to clarify the present boundaries of these systems and outline promising directions for their extension.

Paper Structure

This paper contains 40 sections, 2 figures, 6 tables.

Figures (2)

  • Figure S1: Typical interaction loop between a formal prover and a Large Language Model is the case of formal mathematics. The LLM proposes a proof step (e.g., a tactic), which is checked by an external proof assistant that explicitly maintains and updates the proof state. Unlike standard reinforcement learning loops with an evolving external environment, state transitions here are symbolic and tool-mediated, and the model does not internally persist the deductive state but receives it anew at each iteration. Feedback or hints are generated when necessary, and the loop continues until a complete proof is assembled.
  • Figure S2: A sequence of progressive changes proposed by AlphaEvolve to discover faster matrix multiplication algorithms. The smooth search space facilitates incremental improvement.