Table of Contents
Fetching ...

Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization

Willy Chan, Michael Souliman, Jakob Nordhagen, Brando Miranda, Elyas Obbad, Kai Fronsdal Sanmi Koyejo

TL;DR

This work tackles the data bottleneck in autoformalization by introducing backtranslation-based data generation with hand-crafted prompts to emphasize data quality over sheer volume. The authors present three data-generation variants and the AI4Math dataset to train LLMs for translating informal mathematics into Lean4 formalizations, showing strong performance on ProofNet with far fewer tokens than multilingual baselines like MMA. Key findings include that high-quality, few-shot prompts can surpass large, diverse datasets in token efficiency, and that incorporating proof-state information via line-by-line tactics yields substantial gains at a higher cost. Collectively, the approach demonstrates a practical path to reducing computational resources required for formalizing proofs, with potential impact on interactive theorem proving and formal verification.

Abstract

Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data scale. Our findings provide evidence that employing our proposed approaches to generate synthetic data, which prioritizes quality over volume, improves the Autoformalization performance of LLMs as measured by standard benchmarks such as ProofNet. Crucially, our approach outperforms pretrained models using a minimal number of tokens. We also show, through strategic prompting and backtranslation, that our approaches surpass the performance of fine-tuning with extensive multilingual datasets such as MMA on ProofNet with only 1/150th of the tokens. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize proofs, thereby accelerating AI for math.

Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization

TL;DR

This work tackles the data bottleneck in autoformalization by introducing backtranslation-based data generation with hand-crafted prompts to emphasize data quality over sheer volume. The authors present three data-generation variants and the AI4Math dataset to train LLMs for translating informal mathematics into Lean4 formalizations, showing strong performance on ProofNet with far fewer tokens than multilingual baselines like MMA. Key findings include that high-quality, few-shot prompts can surpass large, diverse datasets in token efficiency, and that incorporating proof-state information via line-by-line tactics yields substantial gains at a higher cost. Collectively, the approach demonstrates a practical path to reducing computational resources required for formalizing proofs, with potential impact on interactive theorem proving and formal verification.

Abstract

Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data scale. Our findings provide evidence that employing our proposed approaches to generate synthetic data, which prioritizes quality over volume, improves the Autoformalization performance of LLMs as measured by standard benchmarks such as ProofNet. Crucially, our approach outperforms pretrained models using a minimal number of tokens. We also show, through strategic prompting and backtranslation, that our approaches surpass the performance of fine-tuning with extensive multilingual datasets such as MMA on ProofNet with only 1/150th of the tokens. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize proofs, thereby accelerating AI for math.

Paper Structure

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Autoformalization Task Example. This table compares formal Lean statements with their equivalent informal mathematical statements. The formal statements represent theorems in Lean, while the informal statements convey the theorems in human-readable natural proof language.
  • Figure 3: Distilled Backtranslation Process Graph. Leveraging the MathLib and ProofNet datasets, a "Teacher" model translates formal theorems into synthetic informal language (IL) examples, which are then backtranslated to augment a training dataset with new formal language (FL) examples. A "Student" model, more compact than the teacher, is fine-tuned on this enriched dataset to improve its translation from IL to FL.
  • Figure 4: Regular Expressions can be used to generate a large volume of rudimentary informalizations. Certain Lean proof statements follow predictable patterns that can be captured via a regex. More sophisticated pattern-matching solutions can be built on this framework to increase the amount of paired data available to fine-tune on.
  • Figure 5: Formal-informal pairs used for training are generated via few-shot prompting and distilled backtranslation. Mined proofs from mathlib are used as input into this format string for prompting, and then GPT-4 creates examples that can be utilized for fine-tuning. In contrast to MMA, our method used six examples to informalize theorem statements.
  • Figure 6: Informalization. These strings format the parsed patterns into informal explanations.
  • ...and 1 more figures