A Survey in Mathematical Language Processing
Jordan Meadows, Andre Freitas
TL;DR
The paper surveys five representative areas at the intersection of informal mathematical language and machine learning, highlighting how transformer and graph-based methods drive progress and where gaps persist. It analyzes identifier-definition extraction, formula retrieval, informal premise selection, math word problems, and informal theorem proving, focusing on datasets, architectures, and evaluation metrics. Key contributions include identifying recurring methodological themes (structure-aware representations, dual-modality encodings, pretraining) and common limitations (wildcards, benchmarking, variable typing, external knowledge integration). The survey provides a roadmap for designing end-to-end systems that support automated mathematical reasoning and discovery, while outlining concrete data and training needs to advance beyond current capabilities.
Abstract
Informal mathematical text underpins real-world quantitative reasoning and communication. Developing sophisticated methods of retrieval and abstraction from this dual modality is crucial in the pursuit of the vision of automating discovery in quantitative science and mathematics. We track the development of informal mathematical language processing approaches across five strategic sub-areas in recent years, highlighting the prevailing successful methodological elements along with existing limitations.
