Compression is all you need: Modeling Mathematics

Vitaly Aksenov; Eve Bodnia; Michael H. Freedman; Michael Mulligan

Compression is all you need: Modeling Mathematics

Vitaly Aksenov, Eve Bodnia, Michael H. Freedman, Michael Mulligan

Abstract

Human mathematics (HM), the mathematics humans discover and value, is a vanishingly small subset of formal mathematics (FM), the totality of all valid deductions. We argue that HM is distinguished by its compressibility through hierarchically nested definitions, lemmas, and theorems. We model this with monoids. A mathematical deduction is a string of primitive symbols; a definition or theorem is a named substring or macro whose use compresses the string. In the free abelian monoid $A_n$, a logarithmically sparse macro set achieves exponential expansion of expressivity. In the free non-abelian monoid $F_n$, even a polynomially-dense macro set only yields linear expansion; superlinear expansion requires near-maximal density. We test these models against MathLib, a large Lean~4 library of mathematics that we take as a proxy for HM. Each element has a depth (layers of definitional nesting), a wrapped length (tokens in its definition), and an unwrapped length (primitive symbols after fully expanding all references). We find unwrapped length grows exponentially with both depth and wrapped length; wrapped length is approximately constant across all depths. These results are consistent with $A_n$ and inconsistent with $F_n$, supporting the thesis that HM occupies a polynomially-growing subset of the exponentially growing space FM. We discuss how compression, measured on the MathLib dependency graph, and a PageRank-style analysis of that graph can quantify mathematical interest and help direct automated reasoning toward the compressible regions where human mathematics lives.

Compression is all you need: Modeling Mathematics

Abstract

, a logarithmically sparse macro set achieves exponential expansion of expressivity. In the free non-abelian monoid

, even a polynomially-dense macro set only yields linear expansion; superlinear expansion requires near-maximal density. We test these models against MathLib, a large Lean~4 library of mathematics that we take as a proxy for HM. Each element has a depth (layers of definitional nesting), a wrapped length (tokens in its definition), and an unwrapped length (primitive symbols after fully expanding all references). We find unwrapped length grows exponentially with both depth and wrapped length; wrapped length is approximately constant across all depths. These results are consistent with

and inconsistent with

, supporting the thesis that HM occupies a polynomially-growing subset of the exponentially growing space FM. We discuss how compression, measured on the MathLib dependency graph, and a PageRank-style analysis of that graph can quantify mathematical interest and help direct automated reasoning toward the compressible regions where human mathematics lives.

Paper Structure (29 sections, 7 theorems, 62 equations, 5 figures, 3 tables)

This paper contains 29 sections, 7 theorems, 62 equations, 5 figures, 3 tables.

Introduction
Monoid Models
Free Abelian Monoid
Free Monoid
Interpreting Data from MathLib
Constructing the dependency graph
Wrapped and unwrapped lengths, and depth
Distributions of wrapped and unwrapped lengths, and depth
Unwrapped length versus wrapped length
Wrapped length versus depth
Unwrapped length versus depth
Discriminating between regimes
$A_n$, logarithmic-density macros (Theorem \ref{['thm:abelian-place']})
$A_n$, polynomial-density macros / Waring (Theorem \ref{['thm:abelian-waring']})
$A_n$, double-logarithmic density (Theorem \ref{['thm:abelian-double-log']})
...and 14 more sections

Key Result

Theorem 1

For $A_n$ and any integer $b \geq 2$, the macro set $M = \{ b^j a_i : i = 1, \ldots, n, \ j \geq 1 \}$ has logarithmic density and satisfies for all integers $s \geq 1$. In particular, $f_{G'}(s) = \Theta(b^{s/(n(b-1))})$.

Figures (5)

Figure 1: A FM DH fragment (left) and a corresponding MathLib DAG (right) for deriving $A \wedge B \wedge C$ via $\wedge$-introduction. In the DH, filled dots represent hyperedges: each groups the premises used in a single inference step. Both proofs (via $A \wedge B$ and via $B \wedge C$) are recorded. The DAG selects one proof and replaces the hyperedge with ordinary edges. If both proofs in the DH were recorded within a DAG, there would be an ambiguity in how a given conclusion followed from the premises. Note also that the above example is special in the sense that the hyperedges are reversible.
Figure 2: Distributions of MathLib elements by (a) $\log_2(\text{unwrapped length})$, (b) wrapped length, and (c) depth.
Figure 3: Median $\log_2(\text{unwrapped length})$ versus wrapped length. Each point represents the median over all elements at that wrapped length.
Figure 4: Median wrapped length versus depth.
Figure 5: Median $\log_2(\text{unwrapped length})$ versus depth. The approximately linear relationship indicates exponential growth of unwrapped length with depth.

Theorems & Definitions (20)

Theorem 1: Place notation gives exponential expansion https://github.com/Aksenov239/lean-fun/blob/toy-math-model/LeanFun/theorem1.lean#L1531
proof
Theorem 2: Polylogarithmic density gives quasi-exponential expansion https://github.com/Aksenov239/lean-fun/blob/toy-math-model/LeanFun/theorem2.lean#L1033
proof
Theorem 3: Polynomial density gives infinite expansion https://github.com/Aksenov239/lean-fun/blob/toy-math-model/LeanFun/theorem3.lean#L381
proof
Remark 1
Remark 2
Theorem 4: Polynomial density gives linear expansion https://github.com/Aksenov239/lean-fun/blob/toy-math-model/LeanFun/theorem4.lean#L1565
proof
...and 10 more

Compression is all you need: Modeling Mathematics

Abstract

Compression is all you need: Modeling Mathematics

Authors

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (20)