Table of Contents
Fetching ...

IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

Param Biyani, Shashank Kirtania, Yasharth Bajpai, Sumit Gulwani, Ashish Tiwari

TL;DR

IndiMathBench tackles the autoformalization of Olympiad-style mathematics by building a human-verified Lean 4 benchmark from Indian Olympiads and deploying a scalable human-AI pipeline. The approach combines category-based retrieval, iterative error feedback, and multi-model ensembles, all validated via an interactive dashboard and expert review. Experimental results show that while compilation rates improve with human-guided context and feedback, semantic correctness remains challenging, especially in geometry-heavy problems, underscoring the gap between syntactic validity and true formal understanding. The work provides both a rich benchmark and tooling (dashboard and VS Code extension) to accelerate progress in neural theorem proving and formal mathematics.

Abstract

We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.

IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

TL;DR

IndiMathBench tackles the autoformalization of Olympiad-style mathematics by building a human-verified Lean 4 benchmark from Indian Olympiads and deploying a scalable human-AI pipeline. The approach combines category-based retrieval, iterative error feedback, and multi-model ensembles, all validated via an interactive dashboard and expert review. Experimental results show that while compilation rates improve with human-guided context and feedback, semantic correctness remains challenging, especially in geometry-heavy problems, underscoring the gap between syntactic validity and true formal understanding. The work provides both a rich benchmark and tooling (dashboard and VS Code extension) to accelerate progress in neural theorem proving and formal mathematics.

Abstract

We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.

Paper Structure

This paper contains 53 sections, 2 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: A sample problem from INMO 2014 with its formalization in Lean 4
  • Figure 2: Overview of approach for creating IndicMath dataset: Human is assisted by a multiple LLM annotators. Each LLM generation is conditioned on the natural language and documentation, and goes through a validation check by Lean. Errors are provided as feedback in subsequent iterations. The final generations from all LLMs are summarized by an LLM in a dashboard to help optimize annotation efficiency.
  • Figure 3: The Annotation Dashboard for the human expert to analyze various ranked options, modify and validate the final entry.
  • Figure 4: Annotation time for 12 RMO problems under the three workflows
  • Figure 5: Venn diagram for BEq-passing problems for three different leading models.
  • ...and 3 more figures