Table of Contents
Fetching ...

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

Chaoqun Liu, Wenxuan Zhang, Yiran Zhao, Anh Tuan Luu, Lidong Bing

TL;DR

This work critically reevaluates multilingual prompting for large language models beyond English-centric setups. It shows that translating prompts to English can boost NLP task performance for English-centric models, but prompting in native languages often yields better results on culture-sensitive or real-world queries, especially for non-English-centric systems. By assessing both NLP benchmarks and real user queries, the paper reveals diverse, task-dependent behaviors across models and languages, and highlights the limitations of translation-based approaches. The findings advocate for broader multilingual evaluation and the development of genuinely multilingual models to capture language-specific nuances and culture-specific knowledge.

Abstract

Large language models (LLMs) have demonstrated multilingual capabilities, yet they are mostly English-centric due to the imbalanced training corpora. While prior works have leveraged this bias to enhance multilingual performance through translation, they have been largely limited to natural language processing (NLP) tasks. In this work, we extend the evaluation to real-world user queries and non-English-centric LLMs, offering a broader examination of multilingual performance. Our key contribution lies in demonstrating that while translation into English can boost the performance of English-centric LLMs on NLP tasks, it is not universally optimal. For culture-related tasks that need deep language understanding, prompting in the native language proves more effective as it better captures the nuances of culture and language. Our experiments expose varied behaviors across LLMs and tasks in the multilingual context, underscoring the need for a more comprehensive approach to multilingual evaluation. Therefore, we call for greater efforts in developing and evaluating LLMs that go beyond English-centric paradigms.

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

TL;DR

This work critically reevaluates multilingual prompting for large language models beyond English-centric setups. It shows that translating prompts to English can boost NLP task performance for English-centric models, but prompting in native languages often yields better results on culture-sensitive or real-world queries, especially for non-English-centric systems. By assessing both NLP benchmarks and real user queries, the paper reveals diverse, task-dependent behaviors across models and languages, and highlights the limitations of translation-based approaches. The findings advocate for broader multilingual evaluation and the development of genuinely multilingual models to capture language-specific nuances and culture-specific knowledge.

Abstract

Large language models (LLMs) have demonstrated multilingual capabilities, yet they are mostly English-centric due to the imbalanced training corpora. While prior works have leveraged this bias to enhance multilingual performance through translation, they have been largely limited to natural language processing (NLP) tasks. In this work, we extend the evaluation to real-world user queries and non-English-centric LLMs, offering a broader examination of multilingual performance. Our key contribution lies in demonstrating that while translation into English can boost the performance of English-centric LLMs on NLP tasks, it is not universally optimal. For culture-related tasks that need deep language understanding, prompting in the native language proves more effective as it better captures the nuances of culture and language. Our experiments expose varied behaviors across LLMs and tasks in the multilingual context, underscoring the need for a more comprehensive approach to multilingual evaluation. Therefore, we call for greater efforts in developing and evaluating LLMs that go beyond English-centric paradigms.
Paper Structure (45 sections, 12 figures, 14 tables)

This paper contains 45 sections, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Illustration of two types of LLMs on tasks with varying language dependencies. "English-centric LLMs" refers to LLMs trained mainly in English corpora. "Multilingual LLMs" refers to ideal LLMs equally capable in all languages.
  • Figure 2: Examples illustrating how translation can both improve (a) and degrade (b) the performance of LLMs. The Chinese example is from MGSM shi_language_2022 and the Swahili example is from M3Exam zhang_m3exam_2023. Translation is beneficial when the questions are semantically equivalent across languages. However, for questions that demand deep cultural knowledge, translation can hinder the ability to answer accurately.
  • Figure 3: BLEU scores for translating MGSM questions with different translation systems.
  • Figure 4: Corrections between BLEU scores of translation and MGSM accuracy for the three prompting techniques: Trans-Google, Trans-NLLB and self-translate. Each dot in the figure represents the performance of one model on one language.
  • Figure 5: Win rate comparison for each language using ChatGPT and Llama-2-70B-Chat.
  • ...and 7 more figures