Table of Contents
Fetching ...

ProofWala: Multilingual Proof Data Synthesis and Theorem-Proving

Amitayush Thakur, George Tsoukalas, Greg Durrett, Swarat Chaudhuri

TL;DR

ProofWala presents a multilingual, language-agnostic framework for data collection, model training, and parallel search in interactive theorem proving, uniting Lean and Coq under a standardized pipeline. By building multilingual proof-step data and fine-tuning transformer-based predictors, it demonstrates cross-language transfer and improved prove-at-$k$ performance versus monolingual baselines, with strong evidence after domain adaptation to CategoryTheory. The approach relies on a three-part architecture (interface, training, and parallel search) and leverages parallel Ray-based proof search to scale end-to-end synthesis. All code and multilingual datasets are open-sourced, offering a foundation for cross-domain theorem-proving research and future improvements in multilingual autoformalization and cross-ITP generalization.

Abstract

Neural networks have shown substantial promise at automatic theorem-proving in interactive proof assistants (ITPs) like Lean and Coq. However, most neural theorem-proving models are restricted to specific ITPs, leaving out opportunities for cross-lingual $\textit{transfer}$ between ITPs. We address this weakness with a multilingual proof framework, ${\rm P{\small ROOF}W{\small ALA}}$, that allows a standardized form of interaction between neural theorem-provers and two established ITPs (Coq and Lean). It enables the collection of multilingual proof step data -- data recording the result of proof actions on ITP states -- for training neural provers. ${\rm P{\small ROOF}W{\small ALA}}$ allows the systematic evaluation of a model's performance across different ITPs and problem domains via efficient parallel proof search algorithms. We show that multilingual training enabled by ${\rm P{\small ROOF}W{\small ALA}}$ can lead to successful transfer across ITPs. Specifically, a model trained on a mix of ${\rm P{\small ROOF}W{\small ALA}}$-generated Coq and Lean data outperforms Lean-only and Coq-only models on the standard prove-at-$k$ metric. We open source all code including code for the ${\rm P{\small ROOF}W{\small ALA}}$ Framework (https://github.com/trishullab/proof-wala), and the Multilingual ITP interaction framework (https://github.com/trishullab/itp-interface).

ProofWala: Multilingual Proof Data Synthesis and Theorem-Proving

TL;DR

ProofWala presents a multilingual, language-agnostic framework for data collection, model training, and parallel search in interactive theorem proving, uniting Lean and Coq under a standardized pipeline. By building multilingual proof-step data and fine-tuning transformer-based predictors, it demonstrates cross-language transfer and improved prove-at- performance versus monolingual baselines, with strong evidence after domain adaptation to CategoryTheory. The approach relies on a three-part architecture (interface, training, and parallel search) and leverages parallel Ray-based proof search to scale end-to-end synthesis. All code and multilingual datasets are open-sourced, offering a foundation for cross-domain theorem-proving research and future improvements in multilingual autoformalization and cross-ITP generalization.

Abstract

Neural networks have shown substantial promise at automatic theorem-proving in interactive proof assistants (ITPs) like Lean and Coq. However, most neural theorem-proving models are restricted to specific ITPs, leaving out opportunities for cross-lingual between ITPs. We address this weakness with a multilingual proof framework, , that allows a standardized form of interaction between neural theorem-provers and two established ITPs (Coq and Lean). It enables the collection of multilingual proof step data -- data recording the result of proof actions on ITP states -- for training neural provers. allows the systematic evaluation of a model's performance across different ITPs and problem domains via efficient parallel proof search algorithms. We show that multilingual training enabled by can lead to successful transfer across ITPs. Specifically, a model trained on a mix of -generated Coq and Lean data outperforms Lean-only and Coq-only models on the standard prove-at- metric. We open source all code including code for the Framework (https://github.com/trishullab/proof-wala), and the Multilingual ITP interaction framework (https://github.com/trishullab/itp-interface).

Paper Structure

This paper contains 27 sections, 1 equation, 12 figures, 8 tables.

Figures (12)

  • Figure 1: A Lean 4 theorem and a with a correct proof using ProofWala-Multilingual proof-step generation model. The theorem states that the standard basis matrix, where $c$ is placed in the $(i,j)th$ entry with zeroes elsewhere is block triangular. The first tactic rintro i' j' hij' unfolds the definition of BlockTriangular and adds the variables i', j', as well as the hypothesis hij' : b j' < b i' to the set of hypotheses. The proof proceeds by using established properties of the stdBasisMatrix and resolves by demonstrating an inconsistency with the hypothesis hij : b i $\leq$ b j.
  • Figure 2: The ProofWala Framework with the interaction between different modules. Using ProofWala's interaction & data-collection modules, we collect a multilingual proof dataset from existing formal mathematics repositories in Lean and Coq. The resulting dataset is used to train a multilingual proof step prediction model, supported by ProofWala's training module. The multilingual model is used inside ProofWala's search module to conduct proof search.
  • Figure 3: Distribution of degree of nodes in the proof trees across various data mixes found by different ProofWala models. Across all data mixes, ProofWala-Multilingual models tend to have higher degrees per node. This indicates that ProofWala-Multilingual often find more compilable tactics given a particular proof state, this can increase the chances of eventually finding a proof.
  • Figure 4: An excerpt from the extracted training data sequence, $\pi=\langle (O_0, a_1), \dots, (O_i, a_i), \dots (O_{n-1}, a_n)\rangle$ (see \ref{['sec:problem-formulation']}), for a given theorem in Coq and Lean 4. The training data extracted here is used to train ProofWala proof step generation models. Here, $O_i$ i.e. set of obligations is extracted under start_goals key while $O_{i+1}$ is represented under end_goals. The action $a_i$ is extracted as the value of proof_steps key. There are more fields other than the ones shown in the figure. (a) Shows an example of a Coq proof step, and (b) shows an example of a Lean proof step.
  • Figure 5: Prompt format for training the proof step generation model. (a) shows the prompt format for Coq , (b) shows the prompt format for Lean 4, (c) shows the response format used for Coq, and (d) shows the response format used for Lean 4. We adopted a format similar to the one used in COPRA thakur2024incontext but without any error context. It is important to note that we do not mention any information about the domain or ITP assistant in the prompt. The prompt format is the same for both languages.
  • ...and 7 more figures