What makes math problems hard for reinforcement learning: a case study

Ali Shehper; Anibal M. Medina-Mardones; Lucas Fagan; Bartłomiej Lewandowski; Angus Gruen; Yang Qiu; Piotr Kucharski; Zhenghan Wang; Sergei Gukov

What makes math problems hard for reinforcement learning: a case study

Ali Shehper, Anibal M. Medina-Mardones, Lucas Fagan, Bartłomiej Lewandowski, Angus Gruen, Yang Qiu, Piotr Kucharski, Zhenghan Wang, Sergei Gukov

TL;DR

This work investigates why certain math problems are exceptionally hard for reinforcement learning by focusing on the Andrews–Curtis conjecture as a case study. It develops a multi-pronged approach: classical search (BFS, greedy), reinforcement learning (PPO with varied horizons), and language modeling (decoder-only transformers) to study the hardness distribution across balanced presentations, notably in Miller–Schupp and Akbulut–Kirby series. A central contribution is a principled global hardness measure based on persistent homology, plus analyses of local graph features that predict solvability; the authors show both practical algorithmic advances (supermoves, adaptive action spaces) and new mathematical results (length reductions for AK$(n)$ and AC-trivializations in MS subfamilies). They also connect stability concepts to knot theory, demonstrating that stably AC-trivial presentations arise naturally from unknot diagrams and Wirtinger presentations, while acknowledging misprints and caveats in related literature. Overall, the paper offers a blueprint for learning-to-learn in hard mathematical search problems and provides concrete results that bridge deep mathematics with modern AI methodology.

Abstract

Using a long-standing conjecture from combinatorial group theory, we explore, from multiple perspectives, the challenges of finding rare instances carrying disproportionately high rewards. Based on lessons learned in the context defined by the Andrews-Curtis conjecture, we propose algorithmic enhancements and a topological hardness measure with implications for a broad class of search problems. As part of our study, we also address several open mathematical questions. Notably, we demonstrate the length reducibility of all but two presentations in the Akbulut-Kirby series (1981), and resolve various potential counterexamples in the Miller-Schupp series (1991), including three infinite subfamilies.

What makes math problems hard for reinforcement learning: a case study

TL;DR

and AC-trivializations in MS subfamilies). They also connect stability concepts to knot theory, demonstrating that stably AC-trivial presentations arise naturally from unknot diagrams and Wirtinger presentations, while acknowledging misprints and caveats in related literature. Overall, the paper offers a blueprint for learning-to-learn in hard mathematical search problems and provides concrete results that bridge deep mathematics with modern AI methodology.

Abstract

Paper Structure (47 sections, 16 theorems, 71 equations, 25 figures, 1 table, 6 algorithms)

This paper contains 47 sections, 16 theorems, 71 equations, 25 figures, 1 table, 6 algorithms.

Introduction
Andrews--Curtis conjecture
Classical search algorithms
Breadth-first search
Greedy search
Comparison of performance on Miller--Schupp series
AC-triviality of $\text{MS}(1, w)$
Length reduction for $\text{AK}(n)$
Limitations and extensions
Reinforcement learning
Markov decision process
Proximal policy optimization
Application to the MS series
PPO with constant horizon length: experimental results
New Miller--Schupp trivializations through variable horizon length
...and 32 more sections

Key Result

Theorem A

For every $n\geq 2$, $\text{AK}(n)$ is AC-equivalent to the presentation of length $n+11$. This gives a reduction in length of $AK(n)$ for all $n \geq 5$.

Figures (25)

Figure 1: Comparison of greedy and breadth-first search algorithms as a function of $n$. The number of presentations of the Miller--Schupp series, $\text{MS}(n, w)$, solved by an algorithm is given on the vertical axis.
Figure 2: The maximum increase in the length of a presentation relative to its initial length along the AC trivialization path. The increase is plotted as a function of the initial length of the presentation on the left and as a function of $n$ on the right.
Figure 3: Distribution of lengths of AC-trivialization paths learned by greedy search as a function of maximum increase in presentation length (left) and $n$ (right).
Figure 4: The basic RL cycle.
Figure 5: A comparison of three algorithms ---breadth-first search, greedy search, and Proximal Policy Optimization (PPO) with small and constant horizon length during training --- that we used to search through the space of balanced presentations. The number of presentations of the Miller--Schupp series, $\text{MS}(n, w)$, solved by an algorithm is given on the vertical axis. We compare the performance as a function of $n$ (above) and the length of the presentation (below). Note that varying horizon length during training (not depicted in this figure) helped PPO solve presentations that greedy search could not solve.
...and 20 more figures

Theorems & Definitions (31)

Theorem A
Theorem B
Definition 1: Substitution
Theorem 2
proof
Theorem 3
proof
Theorem 4
Proposition 5: Myasnikov, Myasnikov, and Shpilrain, MMS
Theorem 6
...and 21 more

What makes math problems hard for reinforcement learning: a case study

TL;DR

Abstract

What makes math problems hard for reinforcement learning: a case study

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (25)

Theorems & Definitions (31)