RoMath: A Mathematical Reasoning Benchmark in Romanian

Adrian Cosma; Ana-Maria Bucur; Emilian Radoi

RoMath: A Mathematical Reasoning Benchmark in Romanian

Adrian Cosma, Ana-Maria Bucur, Emilian Radoi

TL;DR

RoMath addresses the lack of Romanian mathematical reasoning benchmarks by introducing a three-subset suite (Baccalaureate, Competitions, Synthetic) totaling $76{,}910$ problems and a semi-automatic pipeline that extracts, structures, and annotates Romanian math content for evaluation. The authors propose an evaluation framework combining exact-output checks for verifiable problems with LLM-based judging for proofs, and demonstrate the utility of open-weight LLMs, LoRA fine-tuning, and GRPO-style rewards, as well as the challenges of translating Romanian math statements to English. Key findings include that some English-centric models can output Romanian math solutions and that translation degrades performance, while verifiable-data training with rewards can boost results; limitations include judge-based evaluation and translation artifacts. The work provides a reproducible resource with open-code and data, offering a baseline for future multilingual and low-resource-language mathematical reasoning research and highlighting the need for language-specific resources beyond simple translation.

Abstract

Mathematics has long been conveyed through natural language, primarily for human understanding. With the rise of mechanized mathematics and proof assistants, there is a growing need to understand informal mathematical text, yet most existing benchmarks focus solely on English, overlooking other languages. This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three subsets: Baccalaureate, Competitions and Synthetic, which cover a range of mathematical domains and difficulty levels, aiming to improve non-English language models and promote multilingual AI development. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models and emphasizes the need for dedicated resources beyond simple automatic translation. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages. Code and datasets are be made available.

RoMath: A Mathematical Reasoning Benchmark in Romanian

TL;DR

RoMath addresses the lack of Romanian mathematical reasoning benchmarks by introducing a three-subset suite (Baccalaureate, Competitions, Synthetic) totaling

problems and a semi-automatic pipeline that extracts, structures, and annotates Romanian math content for evaluation. The authors propose an evaluation framework combining exact-output checks for verifiable problems with LLM-based judging for proofs, and demonstrate the utility of open-weight LLMs, LoRA fine-tuning, and GRPO-style rewards, as well as the challenges of translating Romanian math statements to English. Key findings include that some English-centric models can output Romanian math solutions and that translation degrades performance, while verifiable-data training with rewards can boost results; limitations include judge-based evaluation and translation artifacts. The work provides a reproducible resource with open-code and data, offering a baseline for future multilingual and low-resource-language mathematical reasoning research and highlighting the need for language-specific resources beyond simple translation.

Abstract

Paper Structure (14 sections, 4 equations, 5 figures, 11 tables)

This paper contains 14 sections, 4 equations, 5 figures, 11 tables.

Introduction
Related Work
Method
Dataset Construction
RoMath Suite
Evaluation Procedure
Baselines and Results
Judge Evaluation
Model Benchmark
Training with Verifiable Rewards
Impact of the Judge Model
Translating Romanian Problems to English
Conclusions and Future Directions
Appendix

Figures (5)

Figure 1: Overall diagram of our approach to curating problems from existing PDFs. We employ MathPix mathpix to OCR PDFs and obtain markdown with LaTeX formatting for mathematical statements. We further process the markdown using proprietary LLMs to split into sub-problems, associate problems with the appropriate solution and annotate each problem with metadata.
Figure 2: Distribution of the number of problems per domain for Baccalaureate, Competitions and Synthetic.
Figure 3: Performance of Romanian models and math-specialized models on each domain from each RoMath subset.
Figure 4: Performance of GRPO-trained LLama-3.2 and Qwen2 on on a subset of Baccalaureate that has verifiable answers.
Figure 5: Performance using different judge models.

RoMath: A Mathematical Reasoning Benchmark in Romanian

TL;DR

Abstract

RoMath: A Mathematical Reasoning Benchmark in Romanian

Authors

TL;DR

Abstract

Table of Contents

Figures (5)