Table of Contents
Fetching ...

Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph

Hanyu Wang, Ruohan Xie, Yutong Wang, Guoxiong Gao, Xintao Yu, Bin Dong

TL;DR

Aria addresses a core bottleneck in auto-formalization of conjecture-level mathematics by integrating retrieval-augmented generation, a Graph-of-Thought planning pipeline, and a compiler-guided self-reflection loop. It introduces AriaScorer, a term-grounded semantic checker that retrieves Mathlib definitions to verify that Lean formalizations faithfully reflect informal statements. Through a GoT-driven decomposition and bottom-up synthesis, Aria achieves state-of-the-art final accuracy across ProofNet, FATE-H, FATE-X, and homological conjectures, significantly outperforming baselines that rely on single-step generation or surface-text checks. Ablation studies show the necessity of all components—RAG, GoT, and reflection—for high performance, especially on conjecture-level problems—highlighting the practical potential of automated, reliable formalization to support automated proof and mathematics research.

Abstract

Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions. To tackle these issues, we present Aria (Agent for Retrieval and Iterative Autoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce AriaScorer, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification. We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6% compilation success rate and 68.5% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0% vs. 24.0% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9% final accuracy while all other models score 0%.

Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph

TL;DR

Aria addresses a core bottleneck in auto-formalization of conjecture-level mathematics by integrating retrieval-augmented generation, a Graph-of-Thought planning pipeline, and a compiler-guided self-reflection loop. It introduces AriaScorer, a term-grounded semantic checker that retrieves Mathlib definitions to verify that Lean formalizations faithfully reflect informal statements. Through a GoT-driven decomposition and bottom-up synthesis, Aria achieves state-of-the-art final accuracy across ProofNet, FATE-H, FATE-X, and homological conjectures, significantly outperforming baselines that rely on single-step generation or surface-text checks. Ablation studies show the necessity of all components—RAG, GoT, and reflection—for high performance, especially on conjecture-level problems—highlighting the practical potential of automated, reliable formalization to support automated proof and mathematics research.

Abstract

Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions. To tackle these issues, we present Aria (Agent for Retrieval and Iterative Autoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce AriaScorer, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification. We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6% compilation success rate and 68.5% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0% vs. 24.0% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9% final accuracy while all other models score 0%.

Paper Structure

This paper contains 37 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The overall pipeline of Aria system. (A) Graph-of-Thought Decomposition: Aria expands the informal statement into a dependency graph of concepts and grounds them in Mathlib. (B) Graph-of-Thought Synthesis: The system executes a bottom-up synthesis procedure guided by the graph, incorporating a self-reflection loop. (C) AriaScorer: A dedicated module that verifies the semantic correctness between the generated formal statement and the original informal statement.
  • Figure 2: The overall pipeline of AriaScorer: informal statements are decomposed into subtasks, grounded with retrieved Lean terms, and their evaluations are aggregated into a final score, which is compared against a threshold $\alpha \in [0,1]$ to yield a binary decision.
  • Figure 3: Dependency Graph of Koethe's Conjecture
  • Figure 4: Dependency Graph of Existence of Balanced Big Cohen–Macaulay Modules Conjecture