Table of Contents
Fetching ...

Do Multilingual Language Models Think Better in English?

Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, Mikel Artetxe

TL;DR

This work probes whether translate-test improvements for multilingual language models arise from external translation resources or from the models' inherent multilingual capabilities. It introduces self-translate, a prompting-based approach that uses the model to translate inputs into English before solving tasks in English, thereby removing the need for an external MT system. Across five tasks and multiple model families, self-translate consistently outperforms direct non-English prompting, with larger gains for high-resource languages and bigger models; while external MT can still beat self-translate, the gap narrows as model scale increases. The study highlights a fundamental limitation in exploiting multilingual potential through prompting alone and suggests that scaling and instruction-tuning may further reduce reliance on intermediate translation steps, enhancing practical multilingual reasoning capabilities.

Abstract

Translate-test is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system, and running inference over the translated input. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel data not seen by the language model. In this work, we introduce a new approach called self-translate, which overcomes the need of an external translation system by leveraging the few-shot translation capabilities of multilingual language models. Experiments over 5 tasks show that self-translate consistently outperforms direct inference, demonstrating that language models are unable to leverage their full multilingual potential when prompted in non-English languages. Our code is available at https://github.com/juletx/self-translate.

Do Multilingual Language Models Think Better in English?

TL;DR

This work probes whether translate-test improvements for multilingual language models arise from external translation resources or from the models' inherent multilingual capabilities. It introduces self-translate, a prompting-based approach that uses the model to translate inputs into English before solving tasks in English, thereby removing the need for an external MT system. Across five tasks and multiple model families, self-translate consistently outperforms direct non-English prompting, with larger gains for high-resource languages and bigger models; while external MT can still beat self-translate, the gap narrows as model scale increases. The study highlights a fundamental limitation in exploiting multilingual potential through prompting alone and suggests that scaling and instruction-tuning may further reduce reliance on intermediate translation steps, enhancing practical multilingual reasoning capabilities.

Abstract

Translate-test is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system, and running inference over the translated input. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel data not seen by the language model. In this work, we introduce a new approach called self-translate, which overcomes the need of an external translation system by leveraging the few-shot translation capabilities of multilingual language models. Experiments over 5 tasks show that self-translate consistently outperforms direct inference, demonstrating that language models are unable to leverage their full multilingual potential when prompted in non-English languages. Our code is available at https://github.com/juletx/self-translate.
Paper Structure (21 sections, 3 figures, 21 tables)

This paper contains 21 sections, 3 figures, 21 tables.

Figures (3)

  • Figure 1: XGLM results (average accuracy). We show that self-translate (using the model itself to translate the input into English) works better than using the original input in the non-English language.
  • Figure 2: Direct inference (top) vs. self-translate (bottom). In direct inference (standard) the task is solved by prompting the model in the original language. In self-translate (proposed), we first translate the input into English by prompting the same model, and then solve the task in English.
  • Figure 3: Downstream (top) and MT (bottom) performance, grouped by low-resource (left) and high-resources (right) languages. For downstream, we report average accuracy over XStoryCloze, XCOPA and XNLI, which have the most language variety. Low- and high-resource languages follow lin-etal-2022-shot, merging the low and ex-low categories. For MT, we report COMET rei2022comet, using the target language text for each field in those datasets as the source, and the English text as the reference.