Table of Contents
Fetching ...

Learning Interestingness in Automated Mathematical Theory Formation

George Tsoukalas, Rahul Saha, Amitayush Thakur, Sabrina Reguyal, Swarat Chaudhuri

TL;DR

This work tackles open-ended automated mathematics by formulating theory formation as a reinforcement learning problem and introducing Fermat, an environment where an agent builds definitions, conjectures, and proofs within a knowledge-graph

Abstract

We take two key steps in automating the open-ended discovery of new mathematical theories, a grand challenge in artificial intelligence. First, we introduce $\emph{FERMAT}$, a reinforcement learning (RL) environment that models concept discovery and theorem-proving using a set of symbolic actions, opening up a range of RL problems relevant to theory discovery. Second, we explore a specific problem through $\emph{FERMAT}$: automatically scoring the $\emph{interestingness}$ of mathematical objects. We investigate evolutionary algorithms for synthesizing nontrivial interestingness measures. In particular, we introduce an LLM-based evolutionary algorithm that features function abstraction, leading to notable improvements in discovering elementary number theory and finite fields over hard-coded baselines. We open-source the $\emph{FERMAT}$ environment at this URL(https://github.com/trishullab/Fermat).

Learning Interestingness in Automated Mathematical Theory Formation

TL;DR

This work tackles open-ended automated mathematics by formulating theory formation as a reinforcement learning problem and introducing Fermat, an environment where an agent builds definitions, conjectures, and proofs within a knowledge-graph

Abstract

We take two key steps in automating the open-ended discovery of new mathematical theories, a grand challenge in artificial intelligence. First, we introduce , a reinforcement learning (RL) environment that models concept discovery and theorem-proving using a set of symbolic actions, opening up a range of RL problems relevant to theory discovery. Second, we explore a specific problem through : automatically scoring the of mathematical objects. We investigate evolutionary algorithms for synthesizing nontrivial interestingness measures. In particular, we introduce an LLM-based evolutionary algorithm that features function abstraction, leading to notable improvements in discovering elementary number theory and finite fields over hard-coded baselines. We open-source the environment at this URL(https://github.com/trishullab/Fermat).

Paper Structure

This paper contains 40 sections, 7 equations, 21 figures, 2 algorithms.

Figures (21)

  • Figure 1: A high-level description of Fermat, our environment for mathematical theory formation. At any given time, the current theory (state) is represented as a knowledge graph consisting of the mathematical definitions, conjectures, and theorems discovered so far. At each step, the policy $\pi$ inputs the current state and selects an action to apply, updating the theory with additional information. The action space allows the production of new definitions, conjectures, and proofs of theorems.
  • Figure 2: Overview of EvoAbstract, which aims to discover an optimal interestingness measure for mathematical theory formation. It operates through three phases: (1) Evolution, where populations of candidate programs are generated and refined through LLM-driven mutations; (2) Abstraction, where high-performing programs are analyzed and reusable subroutines are extracted; and (3) Policy Evaluation, where the resulting programs are evaluated within the theory formation environment using Fermat, producing feedback that guides subsequent evolutionary steps.
  • Figure 3: A plot of the best program found per iteration for FunSearch and EvoAbstract, shown for the three different starting knowledge graphs, averaged over four runs. As can be seen, the EvoAbstract and FunSearch methods dominate performance universally across all baselines. On arithmetic_base, EvoAbstract slightly outperforms FunSearch, while on succ_zero_eq and ff_27 EvoAbstract optimizes the interestingness measures early on, but its performance plateaus sooner than FunSearch, which continues to improve.
  • Figure 4: Performance comparison of EvoAbstract against various baseline measures on three starting theories: succ_zero_eq, arithmetic_base, and ff_27. Each baseline receives 64 theory-formation evaluations. For FunSearch and EvoAbstract, we include the average score (standard deviation) of the best found program over four independent runs.
  • Figure 5: DSL snippets
  • ...and 16 more figures