Could Thinking Multilingually Empower LLM Reasoning?
Changjiang Gao, Xu Huang, Wenhao Zhu, Shujian Huang, Lei Li, Fei Yuan
TL;DR
The paper investigates the upper bound of multilingual reasoning in large language models, challenging the dominant English bias by showing that aggregating reasoning across multiple languages can substantially improve performance on GPQA and MGSM tasks. Through systematic experiments across multiple models and languages, it demonstrates a high potential gain in $Acc@k$, robust to translation quality, and identifies key factors such as language mixing and advantage languages that drive gains. However, common answer-selection strategies (majority voting, prompt-based guidance, and LLM-based judging) struggle to reliably realize this upper bound due to biases and reliance on optimal language combinations. The work highlights both the promise and the practical challenges of multilingual reasoning, providing a foundation for future research to devise robust, language-agnostic strategies that consistently harness multilingual information for improved reasoning in LLMs.
Abstract
Previous work indicates that large language models exhibit a significant "English bias", i.e. they often perform better when tasks are presented in English. Interestingly, we have observed that using certain other languages in reasoning tasks can yield better performance than English. However, this phenomenon remains under-explored. In this paper, we explore the upper bound of harnessing multilingualism in reasoning tasks, suggesting that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning. Besides analyzing the reason behind the upper bound and challenges in reaching it, we also find that common answer selection methods cannot achieve this upper bound, due to their limitations and biases. These insights could pave the way for future research aimed at fully harnessing the potential of multilingual reasoning in LLMs.
