RoMath: A Mathematical Reasoning Benchmark in Romanian
Adrian Cosma, Ana-Maria Bucur, Emilian Radoi
TL;DR
RoMath addresses the lack of Romanian mathematical reasoning benchmarks by introducing a three-subset suite (Baccalaureate, Competitions, Synthetic) totaling $76{,}910$ problems and a semi-automatic pipeline that extracts, structures, and annotates Romanian math content for evaluation. The authors propose an evaluation framework combining exact-output checks for verifiable problems with LLM-based judging for proofs, and demonstrate the utility of open-weight LLMs, LoRA fine-tuning, and GRPO-style rewards, as well as the challenges of translating Romanian math statements to English. Key findings include that some English-centric models can output Romanian math solutions and that translation degrades performance, while verifiable-data training with rewards can boost results; limitations include judge-based evaluation and translation artifacts. The work provides a reproducible resource with open-code and data, offering a baseline for future multilingual and low-resource-language mathematical reasoning research and highlighting the need for language-specific resources beyond simple translation.
Abstract
Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three subsets: Baccalaureate, Competitions and Synthetic, which cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages. Code and datasets are be made available.
