A Survey in Mathematical Language Processing

Jordan Meadows; Andre Freitas

A Survey in Mathematical Language Processing

Jordan Meadows, Andre Freitas

TL;DR

The paper surveys five representative areas at the intersection of informal mathematical language and machine learning, highlighting how transformer and graph-based methods drive progress and where gaps persist. It analyzes identifier-definition extraction, formula retrieval, informal premise selection, math word problems, and informal theorem proving, focusing on datasets, architectures, and evaluation metrics. Key contributions include identifying recurring methodological themes (structure-aware representations, dual-modality encodings, pretraining) and common limitations (wildcards, benchmarking, variable typing, external knowledge integration). The survey provides a roadmap for designing end-to-end systems that support automated mathematical reasoning and discovery, while outlining concrete data and training needs to advance beyond current capabilities.

Abstract

Informal mathematical text underpins real-world quantitative reasoning and communication. Developing sophisticated methods of retrieval and abstraction from this dual modality is crucial in the pursuit of the vision of automating discovery in quantitative science and mathematics. We track the development of informal mathematical language processing approaches across five strategic sub-areas in recent years, highlighting the prevailing successful methodological elements along with existing limitations.

A Survey in Mathematical Language Processing

TL;DR

Abstract

Paper Structure (9 sections, 6 figures, 1 table)

This paper contains 9 sections, 6 figures, 1 table.

Introduction
Representative Tasks
Identifier-Definition Extraction
Formula Retrieval
Informal Premise Selection
Math Word Problems
Informal Theorem Proving
Common Limitations and Advantages
Conclusion and Future Work

Figures (6)

Figure 1: Extractive tasks such as identifier-definition extraction rely on supporting text for task solutions that require lookup, rather than generating mathematical text not necessarily contained within the supporting text. Solvers are capable of inference beyond the supporting text, whereas theorem proving and automated mathematical discovery type abstractive tasks tend to require abstractive approaches.
Figure 2: taxonomy for approaches related to identifier-definition extraction. "Intra-doc" and "extra-doc" refers to how identifiers and definitions are scoped from supporting text.
Figure 3: taxonomy for approaches related to formula retrieval (math information retrieval). In the "SLT + OPT" (top right) the asterisks in MCAT* and MathBERT* refer to how SLTs and/or OPTs are not encoded directly from trees as seen in any of the Tangent approaches or Approach0. MCAT encodes SLTs implicitly through consideration of MathML Presentation, and OPTs through MathML Content. MathBERT encodes OPT tree information but implicitly encodes SLT informaton through LaTeX formulae, similarly to MCAT. The number in the bottom right of the lower-most boxes is the harmonic mean of partial and full Bpref@1000, as shown in Table \ref{['table:MIR']}.
Figure 4: Formula (a) $y=e^x$ with its Symbol Layout Tree (SLT) (b), and Operator Tree (OPT) (c). SLTs represent formula appearance by the spatial arrangements of math symbols, while OPTs define the mathematical operations represented in expressions.
Figure 5: Example of a conjecture and its premises.
...and 1 more figures

A Survey in Mathematical Language Processing

TL;DR

Abstract

A Survey in Mathematical Language Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)