MIRB: Mathematical Information Retrieval Benchmark
Haocheng Ju, Bin Dong
TL;DR
MIRB introduces a unified benchmark for Mathematical Information Retrieval, spanning four task types—Semantic Statement Retrieval, Question-Answer Retrieval, Premise Retrieval, and Formula Retrieval—across 12 datasets of diverse mathematical content and languages. It evaluates 13 retrieval models, including sparse and dense retrievers and proprietary systems, using dynamic corpus strategies and the $nDCG@10$ metric, and finds that reasoning-based tasks are significantly harder than semantic ones, with cross-encoder rerankers often harming performance. The study highlights the gap between current MIR models and the needs of formal-macromath tasks, especially in premise retrieval across formal languages, indicating a need for domain-specific training and data. MIRB aims to provide a comprehensive framework to spur progress in math-specific information retrieval and to guide the development of models better suited to mathematical queries and documents.
Abstract
Mathematical Information Retrieval (MIR) is the task of retrieving information from mathematical documents and plays a key role in various applications, including theorem search in mathematical libraries, answer retrieval on math forums, and premise selection in automated theorem proving. However, a unified benchmark for evaluating these diverse retrieval tasks has been lacking. In this paper, we introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets. We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR. We hope that MIRB provides a comprehensive framework for evaluating MIR systems and helps advance the development of more effective retrieval models tailored to the mathematical domain.
