Table of Contents
Fetching ...

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Hayden Moore, Asfahan Shah

TL;DR

This work probes the robustness of autoformalization systems to semantically equivalent paraphrases of natural language mathematical statements, using two benchmarks (MiniF2F Isabelle/HOL and ProofNet Lean 4) and two modern LLMs (GPT-4o-mini and Claude-3.7-sonnet). It introduces a two-stage evaluation that analyzes semantic fidelity via SBERT-based cosine similarity and BLEU, and syntactic validity via Pass@K, revealing that paraphrasing can meaningfully shift both semantic and compilation outcomes depending on the model pairing. The findings show high semantic preservation across paraphrases but significant variability in formalization success, highlighting model- and source-dependent sensitivity and underscoring the need for more robust, explainable autoformalization pipelines. By extending the paraphrase robustness lens to formal verification, the work echoes similar concerns in text-to-SQL and has practical implications for deploying machine-assisted formal reasoning in real-world settings.

Abstract

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

TL;DR

This work probes the robustness of autoformalization systems to semantically equivalent paraphrases of natural language mathematical statements, using two benchmarks (MiniF2F Isabelle/HOL and ProofNet Lean 4) and two modern LLMs (GPT-4o-mini and Claude-3.7-sonnet). It introduces a two-stage evaluation that analyzes semantic fidelity via SBERT-based cosine similarity and BLEU, and syntactic validity via Pass@K, revealing that paraphrasing can meaningfully shift both semantic and compilation outcomes depending on the model pairing. The findings show high semantic preservation across paraphrases but significant variability in formalization success, highlighting model- and source-dependent sensitivity and underscoring the need for more robust, explainable autoformalization pipelines. By extending the paraphrase robustness lens to formal verification, the work echoes similar concerns in text-to-SQL and has practical implications for deploying machine-assisted formal reasoning in real-world settings.

Abstract

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

Paper Structure

This paper contains 21 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the autoformalization robustness evaluation pipeline for MiniF2F(Isabelle/HOL) and ProofNet(Lean 4). Each formal system undergoes both forward (Formal→NL) and reverse (NL→Formal) paraphrasing stages, using both GPT-4o-mini gpt and Claude-3.7-sonnet claude, with consistency evaluated via similarity metrics and Pass@K accuracy.
  • Figure 2: MiniF2F (Isabelle): Panels show sentence length, lexical diversity, and SBERT semantic similarity for GPT-4o-mini and Claude-3.7 paraphrasings.
  • Figure 3: ProofNet (Lean 4): Panels show sentence length, lexical diversity, and SBERT semantic similarity for GPT-4o-mini and Claude-3.7 paraphrasings.
  • Figure 4: MiniF2F (Isabelle) autoformalization cross-evaluation results. Pass@K accuracy is reported for $K\!\in\!\{1,2,5,10\}$ across semantic (BLEU) and syntactic (Compilation) metrics. Results are shown for both formalization models: Claude-3.7-Sonnet and GPT-4o-mini.
  • Figure 5: ProofNet (Lean 4) autoformalization cross-evaluation results. Pass@K accuracy is reported for $K\!\in\!\{1,2,5,10\}$ across semantic (BLEU) and syntactic (Compilation) metrics. Results are shown for both formalization models: Claude-3.7-Sonnet and GPT-4o-mini.