Table of Contents
Fetching ...

mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?

Tianze Hua, Tian Yun, Ellie Pavlick

TL;DR

Multilingual Othello (mOthello) provides a controlled setting to disentangle language-neutral representation learning from cross-lingual transfer. The authors show that naive multilingual pretraining fails to align representations across languages, and that anchor tokens improve alignment; however, transfer does not follow from alignment alone. A unified output-space pretraining approach achieves both alignment and cross-lingual transfer, even across more than two languages. These results challenge the notion that representation alignment is sufficient for transfer and point to training objectives that enforce a shared language-neutral output space as a practical path for multilingual generalization.

Abstract

Many pretrained multilingual models exhibit cross-lingual transfer ability, which is often attributed to a learned language-neutral representation during pretraining. However, it remains unclear what factors contribute to the learning of a language-neutral representation, and whether the learned language-neutral representation suffices to facilitate cross-lingual transfer. We propose a synthetic task, Multilingual Othello (mOthello), as a testbed to delve into these two questions. We find that: (1) models trained with naive multilingual pretraining fail to learn a language-neutral representation across all input languages; (2) the introduction of "anchor tokens" (i.e., lexical items that are identical across languages) helps cross-lingual representation alignment; and (3) the learning of a language-neutral representation alone is not sufficient to facilitate cross-lingual transfer. Based on our findings, we propose a novel approach - multilingual pretraining with unified output space - that both induces the learning of language-neutral representation and facilitates cross-lingual transfer.

mOthello: When Do Cross-Lingual Representation Alignment and Cross-Lingual Transfer Emerge in Multilingual Models?

TL;DR

Multilingual Othello (mOthello) provides a controlled setting to disentangle language-neutral representation learning from cross-lingual transfer. The authors show that naive multilingual pretraining fails to align representations across languages, and that anchor tokens improve alignment; however, transfer does not follow from alignment alone. A unified output-space pretraining approach achieves both alignment and cross-lingual transfer, even across more than two languages. These results challenge the notion that representation alignment is sufficient for transfer and point to training objectives that enforce a shared language-neutral output space as a practical path for multilingual generalization.

Abstract

Many pretrained multilingual models exhibit cross-lingual transfer ability, which is often attributed to a learned language-neutral representation during pretraining. However, it remains unclear what factors contribute to the learning of a language-neutral representation, and whether the learned language-neutral representation suffices to facilitate cross-lingual transfer. We propose a synthetic task, Multilingual Othello (mOthello), as a testbed to delve into these two questions. We find that: (1) models trained with naive multilingual pretraining fail to learn a language-neutral representation across all input languages; (2) the introduction of "anchor tokens" (i.e., lexical items that are identical across languages) helps cross-lingual representation alignment; and (3) the learning of a language-neutral representation alone is not sufficient to facilitate cross-lingual transfer. Based on our findings, we propose a novel approach - multilingual pretraining with unified output space - that both induces the learning of language-neutral representation and facilitates cross-lingual transfer.
Paper Structure (33 sections, 8 figures, 4 tables)

This paper contains 33 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of three multilingual training approaches. Blue and green blocks represent contexts in 2 different languages, and tokens from the same language have the same color. A multilingual model M consumes a,b,c,d and predicts the corresponding output e. Top: A model is trained on multilingual corpora, with an objective to predict the next tokens specific to each language. Middle: A model is trained on multilingual corpora, where there are tokens shared across language pairs. These tokens are named as anchor tokens. The objective is still to predict the next tokens specific to each language. Bottom: A model is trained on multilingual corpora, with an objective to predict the next tokens in a unified output space.
  • Figure 2: An illustration of mOthello. Left: We map game moves to language-specific tokens $t_j^k$ by using a function $f_k$ for language $\mathcal{L}_k$. Right: We create multilingual Othello corpus by mapping Othello game sequences to multilingual Othello language-specific sequences.
  • Figure 3: An illustration of the probe training procedure and the cross-lingual alignment probing set-up. Left: we train a probe $\mathcal{P}_1^l$ on the activations at layer $l$ of an mOthelloGPT, using only input sequences in language $\mathcal{L}_1$. The ground-truth labels are obtained by interacting with Othello environment. Right: after probe $\mathcal{P}_1^l$ is trained, we use it to recover the board state given activations at layer $l$ of the same mOthelloGPT model, but using sequences from another language $\mathcal{L}_2$ .
  • Figure 4: Pairwise cross-lingual alignment probe accuracy for mOthelloGPT trained on 20 atomic languages with naive multilingual pretraining. Each cell $c_{(i,j)}$ reflects the cross-lingual alignment probe accuracy from language $\mathcal{L}_i$ to $\mathcal{L}_j$. For instance, cell $c_{(0,1)}$ indicates the accuracy of board state prediction from input sequences in language $\mathcal{L}_1$ with probe trained on language $\mathcal{L}_0$ to be 0.52. We observe clusters of languages whose representations are aligned with each other, while the alignment of representations across clusters are poor.
  • Figure 5: Cross-lingual transfer performance under naive, anchor tokens and unified output space training approaches, of mOthelloGPTs trained on different pairs of languages. Columns (left to right): 1) when 0 anchor tokens are introduced, poor language-neutral representations are learned, which is indicated by the low cross-lingual alignment probe accuracy, 2) when 8 anchor tokens are introduced, rich language-neutral representations are learned in all language pairs, yet cross-lingual transfer performance is poor, indicated by the declining of the target language performance, and 3) when the unified output space approach is taken for training and fine-tuning, we observe that in all language pairs representations are well aligned -- moreover, cross-lingual transfer is also observed, indicated by the improvement of the target language performance.
  • ...and 3 more figures