Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving
Kuo Zhou, Lu Zhang
TL;DR
This paper tackles the challenge of verifying and improving LLM-based mathematical problem solving. It introduces MATH-VF, a training-free framework comprising a Formalizer that maps informal solutions into a formal language ($SimpleMath$) and a Critic that uses external tools (SymPy and Z3) to verify each step and provide corrective feedback. A key insight is the sparsity of the Solution Graph, which reduces the number of premises the Critic must consider, enabling efficient, accurate verification and refinement. Empirical results on MATH500 and ProcessBench show that MATH-VF outperforms other training-free baselines in verification and refinement, offering greater stability across problem difficulty and compatibility with closed-source models, while remaining distinct from fully training-based methods in its approach and requirements.
Abstract
Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.
