Table of Contents
Fetching ...

FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4

Jiarui Yao, Ruida Wang, Tong Zhang

TL;DR

This work presents FANS, a Lean4-based framework that endows natural-language math reasoning with a formal, verifiable backbone by translating NL statements into Lean4, generating and verifying formal proofs, and using verification outcomes to guide answer selection. The approach combines a Long CoT NL→FL translator, Lean4 provers, and a verifier, with fallback mechanisms to MV or reward-model-based baselines when proofs fail. Empirical results on MATH500 and AMC23 show consistent improvements across base models, with notable gains in algebra and number theory and substantial benefits for weaker models on AMC23. The study demonstrates that NL solutions can be grounded in formal mathematics, enhancing trustworthiness and interpretability, and it lays out future directions for translator robustness and broader library support.

Abstract

Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates. The inherent ambiguity of Natural Language (NL) limits LLMs' ability to perform verifiable reasoning, making its answers lack coherence and trustworthy support. To tackle the above problems, we propose a novel framework named FANS: Formal ANswer Selection for Natural Language Math Reasoning Using Lean4. To the best of our knowledge, it is the first framework that utilizes Lean4 to enhance LLMs' NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it tries to prove it using a Lean4 prover and verify it by Lean4. Finally, it uses the FL result to assist in answer selection. It enhances LLMs' NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model. Extensive experiments indicate the effectiveness of our framework. It can improve the accuracy rate of reward model enhanced LLMs in the MATH-500 dataset by at most 1.91% and AMC-23 by at most 8.33% on strong reward-model baselines. In some particular fields like number theory that Lean4 experts in, we can even select all correct solutions. The qualitative analysis also shows our framework can make NL results formally backed by Lean4 proofs. As a pioneering work in the corresponding field, we will open-source all our models and datasets to further boost the development of the field.

FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4

TL;DR

This work presents FANS, a Lean4-based framework that endows natural-language math reasoning with a formal, verifiable backbone by translating NL statements into Lean4, generating and verifying formal proofs, and using verification outcomes to guide answer selection. The approach combines a Long CoT NL→FL translator, Lean4 provers, and a verifier, with fallback mechanisms to MV or reward-model-based baselines when proofs fail. Empirical results on MATH500 and AMC23 show consistent improvements across base models, with notable gains in algebra and number theory and substantial benefits for weaker models on AMC23. The study demonstrates that NL solutions can be grounded in formal mathematics, enhancing trustworthiness and interpretability, and it lays out future directions for translator robustness and broader library support.

Abstract

Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates. The inherent ambiguity of Natural Language (NL) limits LLMs' ability to perform verifiable reasoning, making its answers lack coherence and trustworthy support. To tackle the above problems, we propose a novel framework named FANS: Formal ANswer Selection for Natural Language Math Reasoning Using Lean4. To the best of our knowledge, it is the first framework that utilizes Lean4 to enhance LLMs' NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it tries to prove it using a Lean4 prover and verify it by Lean4. Finally, it uses the FL result to assist in answer selection. It enhances LLMs' NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model. Extensive experiments indicate the effectiveness of our framework. It can improve the accuracy rate of reward model enhanced LLMs in the MATH-500 dataset by at most 1.91% and AMC-23 by at most 8.33% on strong reward-model baselines. In some particular fields like number theory that Lean4 experts in, we can even select all correct solutions. The qualitative analysis also shows our framework can make NL results formally backed by Lean4 proofs. As a pioneering work in the corresponding field, we will open-source all our models and datasets to further boost the development of the field.

Paper Structure

This paper contains 26 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparision between FANS and majority vote, together with ORM and ORM + FANS method. From the results, we could see FANS based on ORM achieve the highest accuracies consistently across different base models and different test sets. In particular, FANS performs well on the sub-fields of number theory and algebra, which are better supported by Lean4 with its existing libraries.
  • Figure 2: FANS Framework: The framework shown in the upper part first passes the Natural Language math questions and the LLM-generated answers to our Long CoT NL-to-FL translator. Subsequently, it invokes a prover to prove the translated Lean4 statements and uses the verifier to check whether the proofs are correct. The correct outputs are used for further answer selection as a verifiable foundation. Existing Methods: majority vote and best-of-N ranking based on reward models are shown in the lower part of the figure.
  • Figure 3: Formal language translation prompt without natural language alignment.
  • Figure 4: The prompt used for natural language statements alignment.
  • Figure 5: Example of the aligned natural language statement generated by the alignment model, given the original question and answer.
  • ...and 4 more figures