Lyra: Orchestrating Dual Correction in Automated Theorem Proving

Chuanyang Zheng; Haiming Wang; Enze Xie; Zhengying Liu; Jiankai Sun; Huajian Xin; Jianhao Shen; Zhenguo Li; Yu Li

Lyra: Orchestrating Dual Correction in Automated Theorem Proving

Chuanyang Zheng, Haiming Wang, Enze Xie, Zhengying Liu, Jiankai Sun, Huajian Xin, Jianhao Shen, Zhenguo Li, Yu Li

TL;DR

The paper presents Lyra, a dual-correction framework for automated theorem proving that mitigates LLM hallucinations by post-processing with predefined tools (Tool Correction) and refines conjectures through prover feedback (Conjecture Correction). Built atop a DSP-inspired pipeline, Lyra demonstrates state-of-the-art miniF2F performance with Isabelle, surpassing prior methods by notable margins. Ablation studies show that both TC and CC contribute to gains, with CC requiring more attempts to converge but offering additional refinement when paired with a stronger prover. A case study on IMO problems confirms practical benefits and environment-driven refinement potential for broader formal proving tasks.

Abstract

Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framework that employs two distinct correction mechanisms: Tool Correction (TC) and Conjecture Correction (CC). To implement Tool Correction in the post-processing of formal proofs, we leverage prior knowledge to utilize predefined prover tools (e.g., Sledgehammer) for guiding the replacement of incorrect tools. Tool Correction significantly contributes to mitigating hallucinations, thereby improving the overall accuracy of the proof. In addition, we introduce Conjecture Correction, an error feedback mechanism designed to interact with prover to refine formal proof conjectures with prover error messages. Compared to the previous refinement framework, the proposed Conjecture Correction refines generation with instruction but does not collect paired (generation, error & refinement) prompts. Our method has achieved state-of-the-art (SOTA) performance on both miniF2F validation (48.0% -> 55.3%) and test (45.5% -> 51.2%). We also present 3 IMO problems solved by Lyra. We believe Tool Correction (post-process for hallucination mitigation) and Conjecture Correction (subgoal adjustment from interaction with environment) could provide a promising avenue for future research in this field.

Lyra: Orchestrating Dual Correction in Automated Theorem Proving

TL;DR

Abstract

Paper Structure (16 sections, 12 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 12 figures, 2 tables, 2 algorithms.

Introduction
Related Works
Method
Background: Pipeline of DSP
Tool Correction
Conjecture Correction
Experiment
Dataset
Main Results
Ablation Study
Case Study
Conclusion
Baseline
IMO 1974 p5 Case
IMO 1959 p1 Case
...and 1 more sections

Figures (12)

Figure 1: Our proposed Lyra framework contains two modules. Tool Correction: employ the predefined tools to replace the incorrect tools and prove the conjectures. The prover fails because LLM wrongly believes that $\texttt{by\ (simp\ add:\ div\_mult\_mod\_eq)}$ can prove $x=19*(x~div~19)+4$. Actually, the conjecture is correct and simple, and the prover fails to prove it because it employs an incorrect tool. Hence, the prover successfully proves the conjecture when employing $\texttt{by~arith}$. Conjecture Correction: We design an interaction framework that integrates previous formal sketch and prover error messages to better sketch generation. The steps with the ATPWithTC delimiters are generated by an automated prover with Tool Correction.
Figure 2: Number of problems solved on miniF2F against the number of autoformalization attempts per problem. On miniF2F validation and test set, we have shown the results of Tool Correction (TC) and Conjecture Correction (CC) on human informal proof and GPT-4 informal proof respectively.
Figure 3: A successful formal proof synthesized with human informal proof. With Tool Correction and Conjecture Correction, we successfully solve an IMO problem IMO_1974_p5. The steps with the ATPWithTC delimiters are generated by an automated prover with Tool Correction. We also solve IMO_1959_p1 with GPT-4 informal proof, which is shown in the Appendix.
Figure 4: IMO_1974_p5: first round.
Figure 5: IMO_1974_p5: second round.
...and 7 more figures

Lyra: Orchestrating Dual Correction in Automated Theorem Proving

TL;DR

Abstract

Lyra: Orchestrating Dual Correction in Automated Theorem Proving

Authors

TL;DR

Abstract

Table of Contents

Figures (12)