Table of Contents
Fetching ...

Could Thinking Multilingually Empower LLM Reasoning?

Changjiang Gao, Xu Huang, Wenhao Zhu, Shujian Huang, Lei Li, Fei Yuan

TL;DR

The paper investigates the upper bound of multilingual reasoning in large language models, challenging the dominant English bias by showing that aggregating reasoning across multiple languages can substantially improve performance on GPQA and MGSM tasks. Through systematic experiments across multiple models and languages, it demonstrates a high potential gain in $Acc@k$, robust to translation quality, and identifies key factors such as language mixing and advantage languages that drive gains. However, common answer-selection strategies (majority voting, prompt-based guidance, and LLM-based judging) struggle to reliably realize this upper bound due to biases and reliance on optimal language combinations. The work highlights both the promise and the practical challenges of multilingual reasoning, providing a foundation for future research to devise robust, language-agnostic strategies that consistently harness multilingual information for improved reasoning in LLMs.

Abstract

Previous work indicates that large language models exhibit a significant "English bias", i.e. they often perform better when tasks are presented in English. Interestingly, we have observed that using certain other languages in reasoning tasks can yield better performance than English. However, this phenomenon remains under-explored. In this paper, we explore the upper bound of harnessing multilingualism in reasoning tasks, suggesting that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning. Besides analyzing the reason behind the upper bound and challenges in reaching it, we also find that common answer selection methods cannot achieve this upper bound, due to their limitations and biases. These insights could pave the way for future research aimed at fully harnessing the potential of multilingual reasoning in LLMs.

Could Thinking Multilingually Empower LLM Reasoning?

TL;DR

The paper investigates the upper bound of multilingual reasoning in large language models, challenging the dominant English bias by showing that aggregating reasoning across multiple languages can substantially improve performance on GPQA and MGSM tasks. Through systematic experiments across multiple models and languages, it demonstrates a high potential gain in , robust to translation quality, and identifies key factors such as language mixing and advantage languages that drive gains. However, common answer-selection strategies (majority voting, prompt-based guidance, and LLM-based judging) struggle to reliably realize this upper bound due to biases and reliance on optimal language combinations. The work highlights both the promise and the practical challenges of multilingual reasoning, providing a foundation for future research to devise robust, language-agnostic strategies that consistently harness multilingual information for improved reasoning in LLMs.

Abstract

Previous work indicates that large language models exhibit a significant "English bias", i.e. they often perform better when tasks are presented in English. Interestingly, we have observed that using certain other languages in reasoning tasks can yield better performance than English. However, this phenomenon remains under-explored. In this paper, we explore the upper bound of harnessing multilingualism in reasoning tasks, suggesting that multilingual reasoning promises significantly (by nearly 10 Acc@ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning. Besides analyzing the reason behind the upper bound and challenges in reaching it, we also find that common answer selection methods cannot achieve this upper bound, due to their limitations and biases. These insights could pave the way for future research aimed at fully harnessing the potential of multilingual reasoning in LLMs.

Paper Structure

This paper contains 36 sections, 8 figures, 16 tables.

Figures (8)

  • Figure 1: English is not always better than other languages. Evaluation results on the human-translated GPQA rein2023gpqa and MGSM shi2023language datasets (obtained from huang2025benchmax). The red cells indicate greater-than-English scores.
  • Figure 2: An introduction to input samples across various comparison methods, including Multilingual, Repeat, Paraphrase, Repeat-Mix, and Paraphrase-Mix.
  • Figure 3: Compared to Repeat and Paraphrase, Multilingual demonstrates a higher performance upper bound. Acc@17 scores of Multilingual, Paraphrase and Repeat settings of the three models on the human-translated GPQA dataset.
  • Figure 4: Multilingual surpasses Paraphrase and Repeat in Acc@$k$ after $k=3$ in a growing margin. Best Acc@$k$ (out of 17) of Multilingual, Paraphrase and Repeat settings for Qwen2.5-72B with increasing numbers of languages or candidates on the human-translated GPQA dataset.
  • Figure 5: Fully utilizing non-English languages can improve the upper bound. Distribution of Acc@4 scores of all possible 4-candidate-combinations with Qwen2.5-72B on the human-translated GPQA dataset, under different settings.
  • ...and 3 more figures