A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems
Roozbeh Yousefzadeh, Xuenan Cao, Azim Ospanov
TL;DR
This work tackles the scarcity of publicly accessible formal proofs for International Math Olympiad problems and the challenge of generating Lean proofs with AI. It provides complete original Lean proofs for remaining miniF2F problems and constructs a lemma-based dataset by decomposing proofs into approachable building blocks to diagnose model capabilities. Evaluations across state-of-the-art provers reveal that short, line-based proofs are easier for existing systems, while longer, plan-driven proofs require more advanced reasoning. Public release of proofs, lemmas, and code supports robust benchmarking and helps steer research toward generalizable formal-proof generation in Lean.
Abstract
Using AI to write formal proofs for mathematical problems is a challenging task that has seen some advancements in recent years. Automated systems such as Lean can verify the correctness of proofs written in formal language, yet writing the proofs in formal language can be challenging for humans and machines. The miniF2F benchmark has 20 IMO problems in its test set, yet formal proofs are available only for 6 of these problems (3 of which are only written by mathematicians). The model with best accuracy can only prove 2 of these 20 IMO problems, from 1950s and 60s, while its training set is a secret. In this work, we write complete, original formal proofs for the remaining IMO problems in Lean along with 3 extra problems from IMO 2022 and 2023. This effort expands the availability of proof currently in the public domain by creating 5,880 lines of Lean proof. The goal of the paper is to pave the way for developing AI models that can automatically write the formal proofs for all the IMO problems in miniF2F and beyond by providing an evaluation benchmark. In this pursuit, we devise a method to decompose the proofs of these problems into their building blocks, constructing a dataset of 1,329 lemmas with more than 40k lines of Lean code. These lemmas are not trivial, yet they are approachable, providing the opportunity to evaluate and diagnose the failures and successes of AI models. We evaluate the ability of the SOTA LLMs on our dataset and analyze their success and failure modes from different perspectives. Our dataset and code is available at: https://github.com/roozbeh-yz/IMO-Steps.
