IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
Param Biyani, Shashank Kirtania, Yasharth Bajpai, Sumit Gulwani, Ashish Tiwari
TL;DR
IndiMathBench tackles the autoformalization of Olympiad-style mathematics by building a human-verified Lean 4 benchmark from Indian Olympiads and deploying a scalable human-AI pipeline. The approach combines category-based retrieval, iterative error feedback, and multi-model ensembles, all validated via an interactive dashboard and expert review. Experimental results show that while compilation rates improve with human-guided context and feedback, semantic correctness remains challenging, especially in geometry-heavy problems, underscoring the gap between syntactic validity and true formal understanding. The work provides both a rich benchmark and tooling (dashboard and VS Code extension) to accelerate progress in neural theorem proving and formal mathematics.
Abstract
We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.
