Table of Contents
Fetching ...

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

Lynn Chua, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, Chulin Xie, Chiyuan Zhang

TL;DR

The paper investigates whether multilingual LLMs can transfer knowledge across languages and finds a notable crosslingual knowledge barrier: models perform well on explicit crosslingual tasks like translation but struggle to apply learned knowledge when the language context changes. The authors demonstrate this via general knowledge (MMLU) and domain-specific evaluations (Harry Potter and TOFU), revealing substantial gaps in crosslingual QA. They show that inference-time mitigation offers limited relief, while mixed-language fine-tuning on general and domain-specific corpora significantly reduces the barrier, improves crosslingual QA, and benefits out-of-distribution languages. These findings underscore the need for explicit optimization to unlock full crosslingual potential, with practical implications for multilingual AI assistants and cross-language knowledge retrieval; the authors also provide public code to support further research.

Abstract

Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz and TOFU benchmark) contexts. Since simple inference-time mitigation methods offer only limited improvement, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is publicly available at https://github.com/google-research/crosslingual-knowledge-barriers.

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

TL;DR

The paper investigates whether multilingual LLMs can transfer knowledge across languages and finds a notable crosslingual knowledge barrier: models perform well on explicit crosslingual tasks like translation but struggle to apply learned knowledge when the language context changes. The authors demonstrate this via general knowledge (MMLU) and domain-specific evaluations (Harry Potter and TOFU), revealing substantial gaps in crosslingual QA. They show that inference-time mitigation offers limited relief, while mixed-language fine-tuning on general and domain-specific corpora significantly reduces the barrier, improves crosslingual QA, and benefits out-of-distribution languages. These findings underscore the need for explicit optimization to unlock full crosslingual potential, with practical implications for multilingual AI assistants and cross-language knowledge retrieval; the authors also provide public code to support further research.

Abstract

Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz and TOFU benchmark) contexts. Since simple inference-time mitigation methods offer only limited improvement, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is publicly available at https://github.com/google-research/crosslingual-knowledge-barriers.

Paper Structure

This paper contains 38 sections, 18 figures, 8 tables.

Figures (18)

  • Figure 1: While multilingual LLMs show promising crosslingual abilities on explicit tasks like machine translation where the source text is provided in the context, they struggle to bridge the language gap on knowledge-intensive tasks that require implicit crosslingual correlation of parametric knowledge, revealing a crosslingual knowledge barrier. Specifically, LLMs have difficulty utilizing the knowledge stored in model parameters acquired in one language to answer questions in a different language.
  • Figure 2: Embeddings of En text and mixed-language-translated text are more closely aligned than baselines. The ellipses represent the covariance confidence intervals.
  • Figure 3: (a) presents examples of original, full-translated, and proposed mixed-language Multi-choice Question (MCQ) formats. (b) shows the monolingual evaluation under 5 languages where 15 LLMs all perform better at answering MMLU MCQs in English. Detailed results under four MMLU domains (STEM, Social Science, Humanities, Others) are in \ref{['fig:monolingual_mmlu_barrier_more']}. (c) shows the results under cross-lingual settings, where * denote the average accuracy across {fr, de, es, it}. LLMs perform worse at answering MCQs in mixed-language settings than in English, especially the GT-option and Mixup translation, indicating the existence of cross-lingual knowledge barriers. (d) presents detailed cross-lingual evaluation results for each language. We observe similar findings for all 15 LLMs in \ref{['fig:mixup_mmlu_barrier_more']}.
  • Figure 4: Crosslingual knowledge barriers across 16 languages under mixed-language MCQ evaluation in MMLU knowledge.
  • Figure 5: Models consistently perform best at answering questions in English, both before and after fine-tuning, indicating the presence of a crosslingual knowledge barrier for domain-specific Harry Potter knowledge (a) and TOFU knowledge (b).
  • ...and 13 more figures