Table of Contents
Fetching ...

MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study

Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, Marc Lelarge

TL;DR

This study investigates translating MiniF2F theorems from Lean and Isabelle to Rocq using state-of-the-art LLMs, addressing cross-system compatibility and benchmarking challenges. A three-stage cascade—one-shot prompting, multi-turn with error feedback, and refined prompting with extended attempts—paired with automatic Petanque verification and expert audits, achieves a translation rate of $\frac{478}{488}$ ($98\%$) with only $2\%$ errors. The results show early-stage capabilities exist, but iterative interaction and targeted prompting yield the largest gains, and human verification remains essential for faithful semantics. Overall, the work demonstrates the viability of LLM-driven cross-proof-language translation and offers a scalable path for FAIR cross-system evaluation and dataset portability.

Abstract

In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.

MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study

TL;DR

This study investigates translating MiniF2F theorems from Lean and Isabelle to Rocq using state-of-the-art LLMs, addressing cross-system compatibility and benchmarking challenges. A three-stage cascade—one-shot prompting, multi-turn with error feedback, and refined prompting with extended attempts—paired with automatic Petanque verification and expert audits, achieves a translation rate of () with only errors. The results show early-stage capabilities exist, but iterative interaction and targeted prompting yield the largest gains, and human verification remains essential for faithful semantics. Overall, the work demonstrates the viability of LLM-driven cross-proof-language translation and offers a scalable path for FAIR cross-system evaluation and dataset portability.

Abstract

In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.

Paper Structure

This paper contains 19 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Cumulative translation results for MiniF2F to Rocq across all experimental stages. Numbers in parentheses indicate the maximum number of attempts allowed per theorem in multi-turn stages.
  • Figure 2: pass@1
  • Figure 3: pass@3
  • Figure 5: Effect of input information on translation faithfulness. Red bars represent errors, dotted bars indicate faithfulness errors, and white bars show valid translations.