Table of Contents
Fetching ...

TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Yupei Li, Philipp Borchert, Gerasimos Lampouras

TL;DR

TopoAlign addresses the data scarcity bottleneck in mathematical autoformalisation by structurally aligning widely available code with Lean 4 statements. It decomposes code into docstrings, main functions, and dependency functions to mirror formal mathematical structure, enabling Code Autoformalisation (CAF) training that blends code and math data. On benchmarks MiniF2F, Putnam, and ProofNet, TopoAlign yields substantial gains for DeepSeek-Math and moderate gains for Herald, demonstrating effective transfer of code-based problem-solving and structural knowledge to formal reasoning. The work introduces a large, 324.5M-token corpus of aligned code and formal data and shows that a balanced mix of code and math data optimizes autoformalisation performance.

Abstract

Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

TL;DR

TopoAlign addresses the data scarcity bottleneck in mathematical autoformalisation by structurally aligning widely available code with Lean 4 statements. It decomposes code into docstrings, main functions, and dependency functions to mirror formal mathematical structure, enabling Code Autoformalisation (CAF) training that blends code and math data. On benchmarks MiniF2F, Putnam, and ProofNet, TopoAlign yields substantial gains for DeepSeek-Math and moderate gains for Herald, demonstrating effective transfer of code-based problem-solving and structural knowledge to formal reasoning. The work introduces a large, 324.5M-token corpus of aligned code and formal data and shows that a balanced mix of code and math data optimizes autoformalisation performance.

Abstract

Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

Paper Structure

This paper contains 16 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Structural similarity between code (left) and formal statements in Lean 4 (right). Code samples extracted from GitHub repositories are decomposed into: the docstring, which maps to informal statements in mathematical problems, the main function, which corresponds to the formal statements, and the dependency functions, which correspond to supporting lemmata and theorems, included in external libraries (e.g. Mathlib for Lean 4).
  • Figure 2: Function-level dependency tree showing the hierarchy of function calls, starting from the root node. Each child node represents a function called by its parent. The docstring for the root node is extracted to represent the description of the problem addressed in the code.
  • Figure 3: Distribution of maximum dependency tree depths across a random sample of 200 repositories.
  • Figure 4: Distribution of the maximum number of siblings in dependency trees across a random sample of 200 repositories.
  • Figure 5: Overview of the training pipeline. The model takes a problem description (for either code or math) and its dependencies as input. The training objective is to generate the corresponding solution: the root code block for code inputs, or the formal statement for math inputs.