Table of Contents
Fetching ...

MMATH: A Multilingual Benchmark for Mathematical Reasoning

Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen

TL;DR

MMATH introduces a multilingual benchmark for complex mathematical reasoning, spanning 374 problems across 10 languages to study off-target language generation in LLMs. It presents a three-stage translation pipeline and two core metrics, Answer Accuracy and Language Consistency Ratio (LCR), to quantify multilingual reasoning performance. Across models, the study highlights persistent language gaps and demonstrates that prompting strategies and English-centric training can improve both accuracy and language consistency, with EN-Think achieving strong averages (e.g., 66.72) and high LCR (97.61). The findings offer practical insights for enhancing multilingual reasoning in LLMs and establish a framework for future multilingual complex reasoning research.

Abstract

The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.

MMATH: A Multilingual Benchmark for Mathematical Reasoning

TL;DR

MMATH introduces a multilingual benchmark for complex mathematical reasoning, spanning 374 problems across 10 languages to study off-target language generation in LLMs. It presents a three-stage translation pipeline and two core metrics, Answer Accuracy and Language Consistency Ratio (LCR), to quantify multilingual reasoning performance. Across models, the study highlights persistent language gaps and demonstrates that prompting strategies and English-centric training can improve both accuracy and language consistency, with EN-Think achieving strong averages (e.g., 66.72) and high LCR (97.61). The findings offer practical insights for enhancing multilingual reasoning in LLMs and establish a framework for future multilingual complex reasoning research.

Abstract

The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.

Paper Structure

This paper contains 31 sections, 10 figures, 14 tables.

Figures (10)

  • Figure 1: A demonstration of off-target generation. The text with a blue background shows a French question, while the red text represents LLMs' English thinking and response, highlighting a language inconsistency.
  • Figure 2: The percentage to think in each language. The vertical is the source language and the horizontal is the target language. For Qwen2.5-32B-Instruct, as its response doesn't contain <think>, we use the whole response language instead.
  • Figure 3: The percentage to answer in each language.
  • Figure 4: The demonstration of our benchmark construction process.
  • Figure 5: Multilingual native language prompts for different languages.
  • ...and 5 more figures