Towards Autoformalization of LLM-generated Outputs for Requirement Verification
Mihir Gupte, Ramesh S
TL;DR
This work investigates autoformalization as a means to verify LLM-generated outputs against natural language requirements by translating NL statements into Lean propositions and using a theorem prover to assess logical equivalence. Through two experiments, the authors demonstrate that logically equivalent NL requirements can be identified, and that LLM-generated outputs can be formally checked for consistency with original requirements, highlighting both potential and limitations. The study lays groundwork for a formal verification framework to improve fidelity and trustworthiness of LLM-powered applications, with emphasis on variable grounding and model interpretability as critical factors. While preliminary, the approach points to scalable methods for ensuring logical correctness in constrained domains and informs future research directions in autoformalization and formal verification for AI systems.
Abstract
Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models (LLMs). While LLMs show promise in generating structured outputs from natural language (NL), such as Gherkin Scenarios from NL feature requirements, there's currently no formal method to verify if these outputs are accurate. This paper takes a preliminary step toward addressing this gap by exploring the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements. We conducted two distinct experiments. In the first one, the autoformalizer successfully identified that two differently-worded NL requirements were logically equivalent, demonstrating the pipeline's potential for consistency checks. In the second, the autoformalizer was used to identify a logical inconsistency between a given NL requirement and an LLM-generated output, highlighting its utility as a formal verification tool. Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs, laying a crucial foundation for future, more extensive studies into this novel application.
