Table of Contents
Fetching ...

Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics

Herman Chau, Helen Jenne, Davis Brown, Jesse He, Mark Raugas, Sara Billey, Henry Kvinge

TL;DR

The paper presents the Algebraic Combinatorics Dataset Repository (ACD Repo), a collection of $9$ datasets designed to enable ML-driven conjecture generation in pure mathematics, particularly algebraic combinatorics. It provides diverse, open-ended problems with large example pools and outlines concrete ML tasks (classification or regression) across objects like partitions, tableaux, permutations, quivers, and lattice paths. Through baseline experiments and case studies, the authors demonstrate how narrow models, transformers, and even foundation-model program synthesis can extract patterns that may guide conjecturing, while highlighting challenges such as data imbalance and representation sensitivity. The work aims to accelerate mathematical discovery by offering a public, task-oriented resource that researchers can reuse to explore conjecturing strategies, interpretability, and algorithmic insights in a mathematically rich setting.

Abstract

With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoning-heavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes an open-ended research-level question and a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by interpretability analysis or program synthesis with LLMs), and discuss some of the challenges involved in designing datasets like these.

Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics

TL;DR

The paper presents the Algebraic Combinatorics Dataset Repository (ACD Repo), a collection of datasets designed to enable ML-driven conjecture generation in pure mathematics, particularly algebraic combinatorics. It provides diverse, open-ended problems with large example pools and outlines concrete ML tasks (classification or regression) across objects like partitions, tableaux, permutations, quivers, and lattice paths. Through baseline experiments and case studies, the authors demonstrate how narrow models, transformers, and even foundation-model program synthesis can extract patterns that may guide conjecturing, while highlighting challenges such as data imbalance and representation sensitivity. The work aims to accelerate mathematical discovery by offering a public, task-oriented resource that researchers can reuse to explore conjecturing strategies, interpretability, and algorithmic insights in a mathematically rich setting.

Abstract

With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoning-heavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes an open-ended research-level question and a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by interpretability analysis or program synthesis with LLMs), and discuss some of the challenges involved in designing datasets like these.

Paper Structure

This paper contains 29 sections, 4 equations, 9 figures, 32 tables.

Figures (9)

  • Figure 1: (Left) A Young diagram for the partition $(3,2,2)$. (Center) A standard Young tableau for the partition $(3,2,2)$. (Right) A semistandard Young tableau for the partition $(3,2,2)$.
  • Figure 2: (Left) Performance on the Lattice Path Dataset as a function of the lattice path endpoint (larger endpoint means longer and more paths). As $n$ grows in $n \times n-1$, the training set size increases but the problem may also grow harder. (Center) Performance on the type $E$ versus type $D$ quiver classification task as a function of the depth, which must be specified for type $E$ quivers on $n = 10, 11, 12$ vertices, and (Right) the number of vertices $n$.
  • Figure 3: Histogram of $S_{18}$ characters within the interval $[-500,500]$
  • Figure 4: Histogram of $S_{20}$ characters within the interval $[-500,500]$
  • Figure 5: Histogram of $S_{22}$ characters within the interval $[-500,500]$
  • ...and 4 more figures