Table of Contents
Fetching ...

Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

Arsen Shebzukhov

Abstract

Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL' loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.

Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

Abstract

Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL' loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.

Paper Structure

This paper contains 35 sections, 3 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Training pipeline. Phase 1:$f_\theta$ and $g_\phi$ are fine-tuned independently via SFT on FineLeanCorpus, on the NL$\to$Lean4 and Lean4$\to$NL directions respectively. Phase 2:$f_\theta$ is further fine-tuned via GRPO with $g_\phi$ frozen. The reward is the cosine similarity between the original statement $s$ and the back-translated $\text{NL}'$; both live in the same natural language space $\mathcal{S}$.
  • Figure 2: Mean cycle consistency by model and evaluation set. Error bars show $\pm 1$ std.
  • Figure 3: Score vs. difficulty (FLC held-out, 100 problems per difficulty level).
  • Figure 4: RL improvement over SFT curriculum by difficulty level (FLC held-out). The gain decreases monotonically from $0.208$ at difficulty 1 to $0.126$ at difficulty 7.
  • Figure 5: Cycle consistency by mathematical domain on FLC held-out. Error bars show $\pm 1$ std. Domains with $n < 10$ excluded.
  • ...and 5 more figures