Semantic Search over 9 Million Mathematical Theorems
Luke Alexander, Eric Leonen, Sophie Szeto, Artemii Remizov, Ignacio Tejeda, Giovanni Inchiostro, Vasily Ilin
TL;DR
The paper tackles the problem of theorem-level retrieval in mathematics by constructing a large-scale corpus of over $9.2$ million theorem statements and representing each with natural-language slogans generated by an LLM. By embedding slogans with a dense embedding model and performing a two-stage retrieval (initial HNSW-based candidate retrieval followed by cosine reranking), the approach demonstrates strong theorem-level and paper-level retrieval on a curated set of professional-mathematician queries, outperforming baseline tools and even leading LLMs with web access. Key contributions include the large unified theorem corpus, systematic ablations of context and slogan-generation strategies, and the demonstration that slogan-based representations substantially improve retrieval performance over raw LaTeX embeddings. The authors release a public demo and dataset, enabling theorem-level search for researchers and AI agents, with potential applications in objective literature review, premise selection for formal proofs, and retrieval-augmented generation.
Abstract
Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition that answers a query. While semantic search has seen rapid progress, its behavior on large, highly technical corpora such as research-level mathematical theorems remains poorly understood. In this work, we introduce and study semantic theorem retrieval at scale over a unified corpus of $9.2$ million theorem statements extracted from arXiv and seven other sources, representing the largest publicly available corpus of human-authored, research-level theorems. We represent each theorem with a short natural-language description as a retrieval representation and systematically analyze how representation context, language model choice, embedding model, and prompting strategy affect retrieval quality. On a curated evaluation set of theorem-search queries written by professional mathematicians, our approach substantially improves both theorem-level and paper-level retrieval compared to existing baselines, demonstrating that semantic theorem search is feasible and effective at web scale. The theorem search tool is available at \href{https://huggingface.co/spaces/uw-math-ai/theorem-search}{this link}, and the dataset is available at \href{https://huggingface.co/datasets/uw-math-ai/TheoremSearch}{this link}.
