Table of Contents
Fetching ...

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q. Weinberger, Yuhuai Wu

TL;DR

This work addresses the unreliability of LLMs in quantitative reasoning by grounding their reasoning in formal verification. The proposed Dont Trust: Verify (DTV) pipeline autoformalizes informal statements and solutions into Isabelle, then uses an automated theorem prover to verify correctness, filtering to only verified solutions. Across GSM8K, MATH (subset), and MultiArith, DTV yields consistent performance gains over vanilla majority voting without any model finetuning, and gains are amplified when formal statements are generated with stronger models. The approach highlights a practical, scalable path to improve reliability of LLM reasoning by leveraging formal theorem proving environments, while acknowledging current limitations in domain coverage and statement translation fidelity.

Abstract

Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at https://github.com/jinpz/dtv.

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

TL;DR

This work addresses the unreliability of LLMs in quantitative reasoning by grounding their reasoning in formal verification. The proposed Dont Trust: Verify (DTV) pipeline autoformalizes informal statements and solutions into Isabelle, then uses an automated theorem prover to verify correctness, filtering to only verified solutions. Across GSM8K, MATH (subset), and MultiArith, DTV yields consistent performance gains over vanilla majority voting without any model finetuning, and gains are amplified when formal statements are generated with stronger models. The approach highlights a practical, scalable path to improve reliability of LLM reasoning by leveraging formal theorem proving environments, while acknowledging current limitations in domain coverage and statement translation fidelity.

Abstract

Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting -- the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. The code can be found at https://github.com/jinpz/dtv.
Paper Structure (21 sections, 6 equations, 18 figures, 5 tables)

This paper contains 21 sections, 6 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: A pictorial illustration of Don't Trust: Verify. Left: starting from an informal statement, an informal solution is generated by a large language model. The informal statement and solution are then translated into their formal counterparts. Multiple informal solutions for each problem are generated with temperature sampling. Right: An automated theorem prover in the formal environment is used to verify formal solutions against formal statements step by step (indicated by [ATP]). The final answer is chosen using majority voting over only the verified solutions. In this example, Formal Solution #1 successfully proves Formal Statement #1. Formal Solution #2, however, fails to prove the "only if" direction of Formal Statement #2.
  • Figure 2: Examples of faithful formal statements translated from informal statements with correct answers by DTV. Majority voting failed to solve all four problems but DTV solves them successfully. The model is capable of translating complex informal statements precisely. Note that the sentence "Show that it is [answer]" is appended to the original informal problem statement by first extracting the answer from an informal solution. We provide more examples in Appendix \ref{['sec:faithful_statement']}.
  • Figure 3: An example number theory problem of which both the informal statement and solution are translated correctly by DTV. The generated formal statement cannot be directly solved by automated theorem prover. By translating informal solution step by step, the formal statement is proved and the answer is verified. The steps that end with [ATP] are generated by the automated theorem prover.
  • Figure 4: Additional examples of faithful formal statements translated from informal statements with correct answers by DTV. Majority voting failed to solve these problems but DTV solves them successfully. The model is capable of translating complex informal statements precisely. Note that the "Show that it is [answer]" is appended to the original informal problem statement by first extracting the answer from informal solution.
  • Figure 5: Additional example problems of which both the informal statement and solution are translated correctly by DTV. The generated formal statement cannot be directly solved by an automated theorem prover. By translating the informal solution step by step, the formal statement is proved and the answer is verified. The steps that end with [ATP] are generated by the automated theorem prover.
  • ...and 13 more figures