Autoformalizer with Tool Feedback

Qi Guo; Jianing Wang; Jianfei Zhang; Deyang Kong; Xiangzhou Huang; Xiangyu Xi; Wei Wang; Jingang Wang; Xunliang Cai; Shikun Zhang; Wei Ye

Autoformalizer with Tool Feedback

Qi Guo, Jianing Wang, Jianfei Zhang, Deyang Kong, Xiangzhou Huang, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye

TL;DR

Autoformalizer with Tool Feedback (ATF) addresses data scarcity in Automated Theorem Proving by integrating Lean $4$ compiler feedback for syntax and a multi-LLMs-as-judge setup for semantic consistency into an end-to-end formalization workflow. The approach uses a three-stage training pipeline—cold-start, expert iteration, and Direct Preference Optimization (DPO)—to teach the model how to use tool feedback effectively, achieving state-of-the-art results across FormalMath-Lite, ProverBench, and CombiBench benchmarks, with strong human corroboration. ATF demonstrates robust scaling properties at inference time and generalization to out-of-distribution problems, supported by ablations that show the critical role of tool feedback and staged training. The authors open-source Numina-ATF, a dataset of about 750K synthetic formal statements, to accelerate progress in autoformalization and automated theorem proving.

Abstract

Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements. Efforts in recent work shift from directly prompting large language models to training an end-to-end formalizer model from scratch, achieving remarkable advancements. However, existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency. To address this issue, we propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process. By integrating Lean 4 compilers for syntax corrections and employing a multi-LLMs-as-judge approach for consistency validation, the model is able to adaptively refine generated statements according to the tool feedback, enhancing both syntactic validity and semantic consistency. The training of ATF involves a cold-start phase on synthetic tool-calling data, an expert iteration phase to improve formalization capabilities, and Direct Preference Optimization to alleviate ineffective revisions. Experimental results show that ATF markedly outperforms a range of baseline formalizer models, with its superior performance further validated by human evaluations. Subsequent analysis reveals that ATF demonstrates excellent inference scaling properties. Moreover, we open-source Numina-ATF, a dataset containing 750K synthetic formal statements to facilitate advancements in autoformalization and ATP research.

Autoformalizer with Tool Feedback

TL;DR

Abstract

Autoformalizer with Tool Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)