TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition
Yupei Li, Philipp Borchert, Gerasimos Lampouras
TL;DR
TopoAlign addresses the data scarcity bottleneck in mathematical autoformalisation by structurally aligning widely available code with Lean 4 statements. It decomposes code into docstrings, main functions, and dependency functions to mirror formal mathematical structure, enabling Code Autoformalisation (CAF) training that blends code and math data. On benchmarks MiniF2F, Putnam, and ProofNet, TopoAlign yields substantial gains for DeepSeek-Math and moderate gains for Herald, demonstrating effective transfer of code-based problem-solving and structural knowledge to formal reasoning. The work introduces a large, 324.5M-token corpus of aligned code and formal data and shows that a balanced mix of code and math data optimizes autoformalisation performance.
Abstract
Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.
