Table of Contents
Fetching ...

Improving Autoformalization Using Direct Dependency Retrieval

Shaoqi Wang, Lu Yu, Siwei Lou, Feng Yan, Chunjie Yang, Qing Cui, Jun Zhou

TL;DR

The paper tackles autoformalization challenges by introducing Direct Dependency Retrieval (DDR), a generation-plus-verification framework that directly proposes formal library dependencies and verifies their existence via a suffix-array-based check. DDR is trained with a SAC pipeline on a large 500k-sample corpus to produce high-quality dependency labels, enabling scalable retrieval that reduces hallucinations and improves precision and recall. Empirical results show DDR outperforms state-of-the-art retrieval methods and enhances downstream autoformalization metrics BEq and TypeC, with strong stability across multiple attempts and compatibility as a plug-in with existing autoformalizers. The work also provides a large open dataset of library dependencies and introduces metrics for assessing autoformalization difficulty, contributing to scalable, BEq-aligned evaluation in formal mathematics.

Abstract

The convergence of deep learning and formal mathematics has spurred research in formal verification. Statement autoformalization, a crucial first step in this process, aims to translate informal descriptions into machine-verifiable representations but remains a significant challenge. The core difficulty lies in the fact that existing methods often suffer from a lack of contextual awareness, leading to hallucination of formal definitions and theorems. Furthermore, current retrieval-augmented approaches exhibit poor precision and recall for formal library dependency retrieval, and lack the scalability to effectively leverage ever-growing public datasets. To bridge this gap, we propose a novel retrieval-augmented framework based on DDR (\textit{Direct Dependency Retrieval}) for statement autoformalization. Our DDR method directly generates candidate library dependencies from natural language mathematical descriptions and subsequently verifies their existence within the formal library via an efficient suffix array check. Leveraging this efficient search mechanism, we constructed a dependency retrieval dataset of over 500,000 samples and fine-tuned a high-precision DDR model. Experimental results demonstrate that our DDR model significantly outperforms SOTA methods in both retrieval precision and recall. Consequently, an autoformalizer equipped with DDR shows consistent performance advantages in both single-attempt accuracy and multi-attempt stability compared to models using traditional selection-based RAG methods.

Improving Autoformalization Using Direct Dependency Retrieval

TL;DR

The paper tackles autoformalization challenges by introducing Direct Dependency Retrieval (DDR), a generation-plus-verification framework that directly proposes formal library dependencies and verifies their existence via a suffix-array-based check. DDR is trained with a SAC pipeline on a large 500k-sample corpus to produce high-quality dependency labels, enabling scalable retrieval that reduces hallucinations and improves precision and recall. Empirical results show DDR outperforms state-of-the-art retrieval methods and enhances downstream autoformalization metrics BEq and TypeC, with strong stability across multiple attempts and compatibility as a plug-in with existing autoformalizers. The work also provides a large open dataset of library dependencies and introduces metrics for assessing autoformalization difficulty, contributing to scalable, BEq-aligned evaluation in formal mathematics.

Abstract

The convergence of deep learning and formal mathematics has spurred research in formal verification. Statement autoformalization, a crucial first step in this process, aims to translate informal descriptions into machine-verifiable representations but remains a significant challenge. The core difficulty lies in the fact that existing methods often suffer from a lack of contextual awareness, leading to hallucination of formal definitions and theorems. Furthermore, current retrieval-augmented approaches exhibit poor precision and recall for formal library dependency retrieval, and lack the scalability to effectively leverage ever-growing public datasets. To bridge this gap, we propose a novel retrieval-augmented framework based on DDR (\textit{Direct Dependency Retrieval}) for statement autoformalization. Our DDR method directly generates candidate library dependencies from natural language mathematical descriptions and subsequently verifies their existence within the formal library via an efficient suffix array check. Leveraging this efficient search mechanism, we constructed a dependency retrieval dataset of over 500,000 samples and fine-tuned a high-precision DDR model. Experimental results demonstrate that our DDR model significantly outperforms SOTA methods in both retrieval precision and recall. Consequently, an autoformalizer equipped with DDR shows consistent performance advantages in both single-attempt accuracy and multi-attempt stability compared to models using traditional selection-based RAG methods.

Paper Structure

This paper contains 21 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of two paradigms for formal dependency retrieval. (a) Select-based retrieval using embedding cosine similarity: both the input statement and library items are encoded into embeddings, and the top-k items with highest cosine similarity are retrieved as dependencies. (b) DDR generates potential dependencies based on context, followed by SAC verification step.
  • Figure 2: Schematic diagram of SAC process. All library items are concatenated into a single string with delimiters, followed by suffix array construction. Each pending item is then queried via binary search to determine its match status (exact, partial, or none). This mechanism supports both dataset construction and DDR output verification.
  • Figure 3: LLM consistency across multiple attempts. Subplots (a–e) correspond to the performance of DeepSeek-R1 on datasets Diff01, Diff23, Diff45, Diff67, and Diff89, respectively. Each subplot illustrates the proportion of problems ($y$ axis) for which at least k (from 1 to 8, $x$ axis) out of 8 attempts successfully pass the BEq verification. Blue: DDR; Orange: ICL; Green: N/A.
  • Figure 4: LLM efficiency across multiple attempts. Subplots (a–e) correspond to the performance of DeepSeek-R1 on datasets Diff01, Diff23, Diff45, Diff67, and Diff89, respectively. Each subplot illustrates the intermediate pass@k score ($y$ axis) out of 8 attempts using BEq verification, k range from 1 to 8 ($x$ axis). Blue: DDR; Orange: ICL; Green: N/A.
  • Figure 5: LLM consistency across multiple attempts. From left to right correspond to Diff01, Diff23, Diff45, Diff67, and Diff89; from top to bottom correspond to Claude-3.5-Sonnet, DeepSeek-R1, GPT-4o, Ling-flash-2.0, and Qwen-Max. Each subplot illustrates the proportion of problems ($y$ axis) for which at least k (from 1 to 8, $x$ axis) out of 8 attempts successfully pass the BEq verification. Blue: DDR; Orange: ICL; Green: N/A.
  • ...and 2 more figures