Table of Contents
Fetching ...

Herald: A Natural Language Annotated Lean 4 Dataset

Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi Xu, Bin Dong

TL;DR

Herald addresses the shortage of natural language–to–Lean formal language pairs by introducing a hierarchical, structure-aware NL–FL translation pipeline and an open, large-scale Herald dataset derived from Mathlib4. The approach highlights rich contextual augmentation, dependency-based translation order, and two augmentation strategies, enabling state-of-the-art NL–FL translation performance (e.g., 96.7% on MiniF2F-test and 23.5% on a graduate-level dataset) and practical autoformalization demonstrations on real-world text like the Stacks Project. Validation combines Lean compiler checks, back-translation, and NLI, with a focus on project-wide formalization potential rather than single-theorem translation. The work provides a scalable, extensible framework and openly releases the Herald dataset and translator to accelerate progress in mathematical formalization and LLM-based theorem proving.

Abstract

Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 93.2% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 22.5% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (74.0% and 7.5%) and TheoremLlama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, are open-sourced to the public.

Herald: A Natural Language Annotated Lean 4 Dataset

TL;DR

Herald addresses the shortage of natural language–to–Lean formal language pairs by introducing a hierarchical, structure-aware NL–FL translation pipeline and an open, large-scale Herald dataset derived from Mathlib4. The approach highlights rich contextual augmentation, dependency-based translation order, and two augmentation strategies, enabling state-of-the-art NL–FL translation performance (e.g., 96.7% on MiniF2F-test and 23.5% on a graduate-level dataset) and practical autoformalization demonstrations on real-world text like the Stacks Project. Validation combines Lean compiler checks, back-translation, and NLI, with a focus on project-wide formalization potential rather than single-theorem translation. The work provides a scalable, extensible framework and openly releases the Herald dataset and translator to accelerate progress in mathematical formalization and LLM-based theorem proving.

Abstract

Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 93.2% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 22.5% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (74.0% and 7.5%) and TheoremLlama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, are open-sourced to the public.

Paper Structure

This paper contains 62 sections, 29 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of our approach. (a) NL-FL Dataset Generation: We extract statements from Mathlib4 and informalize them by providing the LLM with rich contextual information, particularly dependent theorems, and proceed in dependency-level order to ensure the LLM has access to all relevant natural language annotations. The same pipeline is applied to the proof corpus, aided by the NL-FL statement dataset generated in the previous step. Additionally, we augment the statement corpus in two ways: by breaking down tactic-wise proofs in Mathlib4, with results validated through the Lean compiler, and by using LLMs to generate equivalent NL statements. (b) Autoformalization Pipeline: We train a statement formalizer on the Herald dataset. During formalization, FL statements are first generated by the Herald translator and then fed into a powerful automatic theorem prover (e.g., DeepSeek Prover V1.5) to obtain the final formalized corpus.
  • Figure 2: Illustration of how related instances are retrieved: NL-FL statement examples are embedded and stored in a vector database. The statement being informalized is treated as a query and embedded by the same model, which is designed to account for mathematical similarity. The vector database then retrieves a list of relevant theorems based on cosine similarity of the embeddings.
  • Figure 3: Demonstration of Tactic Augmentation Strategy
  • Figure 4: Demonstration of LLM Informal Augmentation
  • Figure 5: Term-based Proof
  • ...and 3 more figures