Table of Contents
Fetching ...

Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs

Kartik Ravisankar, Hyojung Han, Marine Carpuat

TL;DR

The paper investigates cross-lingual representation alignment in decoder-only LLMs and its relation to multilingual performance. It introduces per-sample metrics—Discriminative Alignment Index (DALI, including DALI_S) and a task-specific MEXA_T—and evaluates them on Belebele, XStorycloze, XCOPA, and FLORES-based translation tasks, using Llama3.1 8B. Findings show strong language-level correlations between alignment to English and task accuracy, but instance-level signals are only predictive in some tasks (notably Belebele) and for translation in several directions, revealing alignment as a necessary but not sufficient condition for success. The work highlights asymmetries in En→XX versus XX→En translation and points to confounding factors and the limits of English-centric alignment, suggesting more nuanced analyses of confidence and calibration are needed for robust multilingual reasoning.

Abstract

Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities, yet the mechanisms driving cross-lingual generalization remain poorly understood. This work investigates how the alignment of representations for text written in different languages correlates with LLM performance on natural language understanding tasks and translation tasks, both at the language and the instance level. For this purpose, we introduce cross-lingual alignment metrics such as the Discriminative Alignment Index (DALI) to quantify the alignment at an instance level for discriminative tasks. Through experiments on three natural language understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions, exposing alignment as a necessary but insufficient condition for success.

Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs

TL;DR

The paper investigates cross-lingual representation alignment in decoder-only LLMs and its relation to multilingual performance. It introduces per-sample metrics—Discriminative Alignment Index (DALI, including DALI_S) and a task-specific MEXA_T—and evaluates them on Belebele, XStorycloze, XCOPA, and FLORES-based translation tasks, using Llama3.1 8B. Findings show strong language-level correlations between alignment to English and task accuracy, but instance-level signals are only predictive in some tasks (notably Belebele) and for translation in several directions, revealing alignment as a necessary but not sufficient condition for success. The work highlights asymmetries in En→XX versus XX→En translation and points to confounding factors and the limits of English-centric alignment, suggesting more nuanced analyses of confidence and calibration are needed for robust multilingual reasoning.

Abstract

Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities, yet the mechanisms driving cross-lingual generalization remain poorly understood. This work investigates how the alignment of representations for text written in different languages correlates with LLM performance on natural language understanding tasks and translation tasks, both at the language and the instance level. For this purpose, we introduce cross-lingual alignment metrics such as the Discriminative Alignment Index (DALI) to quantify the alignment at an instance level for discriminative tasks. Through experiments on three natural language understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions, exposing alignment as a necessary but insufficient condition for success.

Paper Structure

This paper contains 43 sections, 2 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: DALI, a novel cross-lingual alignment measure, is calculated per sample in a discriminative task across transformer layers using its representations. In the above example, we are tasked with picking the right ending ('Meow/Moo' in English; 'Miaou/Meuh' in French) given a premise ('I am a cat/Je suis un chat.' in English and French respectively). DALI=1, if the similarity $\mathcal{S}$ of the representations of cross-lingual matched pairs$>$ than the mismatched pairs, indicating the ability of the model to distinguish parallel English and non-English context in its latent space. A stricter variant, $\texttt{DALI}_{S}$ adds another condition that the similarity of cross-lingual matched pairs must exceed intra-lingual mismatched pairs.
  • Figure 2: Illustration of $\texttt{DALI}_{S}$'s failure in an Italian sample of XCOPA: CS of matched pairs (0.766, 0.770) across languages exceed mismatched pairs (0.762, 0.758), but the high similarity of within-language mismatched pairs (0.938, 0.984) drives $\texttt{DALI}_{S}=0$
  • Figure 3: EC-XC (1042) vs EC-XW (195)
  • Figure 4: Instance-level analyses: $\Delta$ of $\texttt{MEXA}_{T}$ , $\texttt{DALI}$, and $\texttt{DALI}_{S}$ (DALI.S), between EC-XC and EC-XW in $l_{max}$ for the Chinese split of Belebele (Bele), Xstorycloze (Story), and XCOPA
  • Figure 5: Instance-level analysis (Alignment vs Accuracy): Illustration of z-test for proportions between EC-XC and EC-XW using $\texttt{DALI}_{S}$. In the layer with maximum $\texttt{DALI}_{S}$ overall, we calculate the $\Delta$ between EC-XC and EC-XW cohorts. $\Delta=-0.04$, in this example, illustrates that cross-lingual alignment is not associated with correct individual model decisions.
  • ...and 6 more figures