Table of Contents
Fetching ...

Beyond Literal Token Overlap: Token Alignability for Multilinguality

Katharina Hämmerl, Tomasz Limisiewicz, Jindřich Libovický, Alexander Fraser

TL;DR

The paper addresses the limitation of literal token overlap in explaining cross-lingual transfer, especially for language pairs with different scripts, by introducing token alignability as a subword-level metric derived from statistical word alignments. It defines two directional and a symmetrised alignability score using eflomal, trains priors on large parallel data, and evaluates correlations with downstream transfer and cross-lingual embedding alignment across encoder and decoder models. Across encoder models, the eflomal-based alignability outperforms distributional JSD in predicting transfer, particularly for diverse-script pairs, and correlates with embedding alignment, though decoder models show mixed patterns depending on the model. The findings suggest token alignability can guide multilingual tokeniser design and language-pair selection for cross-lingual transfer, with code and reproducibility details published to enable practical adoption.

Abstract

Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss how this insight may be applied to multilingual tokenisation in future work. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer, as well as to guide the construction of better multilingual tokenisers in the future. We publish our code and reproducibility details.

Beyond Literal Token Overlap: Token Alignability for Multilinguality

TL;DR

The paper addresses the limitation of literal token overlap in explaining cross-lingual transfer, especially for language pairs with different scripts, by introducing token alignability as a subword-level metric derived from statistical word alignments. It defines two directional and a symmetrised alignability score using eflomal, trains priors on large parallel data, and evaluates correlations with downstream transfer and cross-lingual embedding alignment across encoder and decoder models. Across encoder models, the eflomal-based alignability outperforms distributional JSD in predicting transfer, particularly for diverse-script pairs, and correlates with embedding alignment, though decoder models show mixed patterns depending on the model. The findings suggest token alignability can guide multilingual tokeniser design and language-pair selection for cross-lingual transfer, with code and reproducibility details published to enable practical adoption.

Abstract

Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to language pairs with different scripts, which can nevertheless show good cross-linguality. This limits the explanatory strength of token overlap for knowledge transfer between language pairs that use distinct scripts or follow different orthographic conventions. In this paper, we propose subword token alignability as a new way to understand the impact and quality of multilingual tokenisation. In particular, this metric predicts multilinguality much better when scripts are disparate and the overlap of literal tokens is low. We analyse this metric in the context of both encoder and decoder models, look at data size as a potential distractor, and discuss how this insight may be applied to multilingual tokenisation in future work. We recommend our subword token alignability metric for identifying optimal language pairs for cross-lingual transfer, as well as to guide the construction of better multilingual tokenisers in the future. We publish our code and reproducibility details.

Paper Structure

This paper contains 22 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Eflomal score (bottom), a measure of token alignability, predicts downstream transfer performance better than the previous metric of distributional token overlap (top). The difference is especially stark for language pairs with different scripts ($\bullet$), compared to language pairs with the same script ($\times$). The orange line shows the linear fit across all included pairs.
  • Figure 2: Unigram model: The eflomal score generally correlates better with downstream transfer than JSD. NER is the exception. Proportion of 1-1 token alignments, while it also breaks up the cluster of different-script language pairs, shows weaker or no correlations.
  • Figure 3: BPE model: The eflomal score correlates better with downstream transfer than JSD, with the exception of XNLI. Proportion of 1-1 token alignments, while it also breaks up the cluster of different-script language pairs, shows weaker or no correlations.
  • Figure 4: TokMix model: The eflomal score correlates better with downstream transfer than JSD, again with the exception of XNLI. Proportion of 1-1 token alignments, while it also breaks up the cluster of different-script language pairs, shows no correlations.
  • Figure 5: BPE Model: Eflomal scores correlates well with cross-lingual embedding alignment. Nevertheless, both metrics perform similarly over the Tatoeba dataset.
  • ...and 3 more figures