Mathematical Derivation Graphs: A Relation Extraction Task in STEM Manuscripts
Vishesh Prasad, Brian Kim, Nickvash Kani
TL;DR
This work defines a new relation extraction task focused on inter-equation dependencies in STEM manuscripts by introducing the Mathematical Derivation Graphs Dataset (MDGD) derived from 107 arXiv papers. It formalizes derivation graphs as directed acyclic graphs where nodes are key equations and edges encode derivational dependencies, and evaluates a spectrum of analytical, ML, and LLM-based methods to extract these edges. The study finds that baseline methods and zero-shot LLMs achieve $F_1$ around $0.45$–$0.52$, with targeted fixes offering incremental improvements but no decisive gains, highlighting the challenge of mathematical relation extraction. The results motivate hybrid analytic-LLM pipelines and task-specific modeling as promising directions for improving machine understanding and reconstruction of mathematical derivations in scholarly texts.
Abstract
Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing natural language text, applying analysis to mathematical equations and their relationships within texts has produced mixed results. This paper takes the initial steps in expanding the problem of relation extraction towards understanding the dependency relationships between mathematical expressions in STEM articles. The authors construct the Mathematical Derivation Graphs Dataset (MDGD), sourced from a random sampling of the arXiv corpus, containing an analysis of $107$ published STEM manuscripts with over $2000$ manually labeled inter-equation dependency relationships, resulting in a new object referred to as a derivation graph that summarizes the mathematical content of the manuscript. The authors exhaustively evaluate analytical and machine learning (ML) based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. The authors show that the best tested LLMs achieve $F_1$ scores of $\sim45\%-52\%$, and attempt to improve their performance by combining them with analytic algorithms and other methods.
