Table of Contents
Fetching ...

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

Niyati Bafna, Tianjian Li, Kenton Murray, David R. Mortensen, David Yarowsky, Hale Sirin, Daniel Khashabi

TL;DR

This work formalizes and quantifies the Translation Barrier Hypothesis: in multilingual generation with large language models, a task-solving stage in largely language-agnostic representations is followed by a translation stage that adapts outputs to the target language. Using a word-translation task across 108 language pairs and two 8B decoder models, the authors define a framework with intermediate-layer analysis (via logit lens), translation loss $TL$, and translation barrier proportion $\text{TLP}$ to apportion final errors between task-solving and translation. They show that translation failure dominates final performance for many language pairs, especially low-resource targets, and that task-solving remains relatively language-agnostic in middle layers but is entangled with language for some targets. Case studies on scale and arithmetic tasks suggest the translation barrier persists across model sizes and tasks, implying a need for modular MT-LLM strategies or improvements in late-stage translation to advance multilingual generation. The findings offer a lens for design choices in multilingual LLM systems and highlight where future work should focus to bridge the gap for low-resource languages.

Abstract

Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages, but the causes for this are not well-understood. We first demonstrate the existence of an implicit task-solving-->translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We quantify the extent to which either stage in the pipeline is responsible for final failure for a word translation task across 108 language pairs, and find that the translation barrier explains a dominant portion of error for a majority of language pairs, and is especially severe for low-resource target languages. Our results highlight an important bottleneck for end-to-end multilingual generation, relevant for future work seeking to improve multilinguality in LLMs.

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

TL;DR

This work formalizes and quantifies the Translation Barrier Hypothesis: in multilingual generation with large language models, a task-solving stage in largely language-agnostic representations is followed by a translation stage that adapts outputs to the target language. Using a word-translation task across 108 language pairs and two 8B decoder models, the authors define a framework with intermediate-layer analysis (via logit lens), translation loss , and translation barrier proportion to apportion final errors between task-solving and translation. They show that translation failure dominates final performance for many language pairs, especially low-resource targets, and that task-solving remains relatively language-agnostic in middle layers but is entangled with language for some targets. Case studies on scale and arithmetic tasks suggest the translation barrier persists across model sizes and tasks, implying a need for modular MT-LLM strategies or improvements in late-stage translation to advance multilingual generation. The findings offer a lens for design choices in multilingual LLM systems and highlight where future work should focus to bridge the gap for low-resource languages.

Abstract

Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages, but the causes for this are not well-understood. We first demonstrate the existence of an implicit task-solving-->translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We quantify the extent to which either stage in the pipeline is responsible for final failure for a word translation task across 108 language pairs, and find that the translation barrier explains a dominant portion of error for a majority of language pairs, and is especially severe for low-resource target languages. Our results highlight an important bottleneck for end-to-end multilingual generation, relevant for future work seeking to improve multilinguality in LLMs.

Paper Structure

This paper contains 47 sections, 5 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Task-solving succeeds with the correct answer concept (i.e. cat) discovered in intermediate layers, expressed in various HRLs, but the model fails to realize or translate the concept into the target LRL.
  • Figure 2: Each plot shows the percentage of on-target correct () and incorrect () outputs in the top half, and off-target correct () and incorrect () outputs in the bottom half. We show this for the last $10$ layers of aya-23 and llama-3.1, for all outputs with a reliable LID tag. * supported language. We observe the task-solving stage with initially high accurate but off-target outputs (), followed by the translation stage, where the models transitions to on-target outputs ($+$). We see that the translation stage is successful for French and Indonesian (HRLs) with high final on-target accuracy (), but fails for Marathi (LRL). For a low-resource source language like Telugu, task-solving itself may fail, as with aya-23, with low off-target accuracy ().
  • Figure 3: % of target language presence among accurate layer outputs for $5$ target languages, averaged over source language. * supported language. This stays low for middle layers, indicating that accurate answers are largely off-target, and increases in final layers, indicating translation to the target language.
  • Figure 4: Distribution over task-solving languages, i.e. the languages of correct intermediate layer outputs, aggregated over all source-target language pairs for aya-23 (top) and llama-3.1 (bottom). We show the top $15$ task-solving languages, covering $85\%$ and $97\%$ of probability mass of the distribution for aya-23 and llama-3.1 respectively. * supported language.
  • Figure 5: Intermediate accuracy (), final accuracy (), and $\mathit{TLP}$ (, \ref{['eqn:tlp']}) for aya-23 and llama-3.1, sorted in ascending order of mean $\mathit{TLP}$ over source language for $18$ target languages, selected to cover the range of mean $\mathit{TLP}$. *: supported language. While intermediate accuracy () is high even for LRLs (Cebuano-ceb, Nepali-nep), final accuracy () is high for supported HRLs (Portuguese-por, German-deu), but drops considerably for LRLs. $\mathit{TLP}$ () is high for most target languages, and especially for low-resource languages. See expanded figure including all target languages in \ref{['sec:all_results']}.
  • ...and 10 more figures