FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4
Jiarui Yao, Ruida Wang, Tong Zhang
TL;DR
This work presents FANS, a Lean4-based framework that endows natural-language math reasoning with a formal, verifiable backbone by translating NL statements into Lean4, generating and verifying formal proofs, and using verification outcomes to guide answer selection. The approach combines a Long CoT NL→FL translator, Lean4 provers, and a verifier, with fallback mechanisms to MV or reward-model-based baselines when proofs fail. Empirical results on MATH500 and AMC23 show consistent improvements across base models, with notable gains in algebra and number theory and substantial benefits for weaker models on AMC23. The study demonstrates that NL solutions can be grounded in formal mathematics, enhancing trustworthiness and interpretability, and it lays out future directions for translator robustness and broader library support.
Abstract
Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates. The inherent ambiguity of Natural Language (NL) limits LLMs' ability to perform verifiable reasoning, making its answers lack coherence and trustworthy support. To tackle the above problems, we propose a novel framework named FANS: Formal ANswer Selection for Natural Language Math Reasoning Using Lean4. To the best of our knowledge, it is the first framework that utilizes Lean4 to enhance LLMs' NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it tries to prove it using a Lean4 prover and verify it by Lean4. Finally, it uses the FL result to assist in answer selection. It enhances LLMs' NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model. Extensive experiments indicate the effectiveness of our framework. It can improve the accuracy rate of reward model enhanced LLMs in the MATH-500 dataset by at most 1.91% and AMC-23 by at most 8.33% on strong reward-model baselines. In some particular fields like number theory that Lean4 experts in, we can even select all correct solutions. The qualitative analysis also shows our framework can make NL results formally backed by Lean4 proofs. As a pioneering work in the corresponding field, we will open-source all our models and datasets to further boost the development of the field.
