Table of Contents
Fetching ...

Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space

Fred Philippy, Siwen Guo, Shohreh Haddadan

TL;DR

The paper tackles how linguistic distance shapes cross-lingual transfer by examining the absolute evolution of language representation spaces in multilingual language models during fine-tuning. It introduces a layer-wise, CKA-based framework to quantify the impact of fine-tuning on each language's representation space via the metric $\Phi^{(i)}(S,T) = 1 - \mathrm{CKA}(H_T^i, H_{S\rightarrow T}^i)$, and uses URIEL-based distance metrics from lang2vec to correlate these changes with linguistic distance. The study reveals that genetic distance correlates across all layers with representation-space impact, while syntactic and geographic distances show stronger correlations in deeper layers; cross-lingual transfer correlates with representation-space impact, especially in deeper layers. It also provides preliminary evidence that selective freezing of layers—targeting those with strong negative correlations to distance—can mitigate transfer gaps to linguistically distant languages, offering a potential strategy for improving zero-shot performance in distant languages. Overall, the work highlights an inter-connected triad among language distance, representation-space evolution, and transfer performance, proposing a path for targeted interventions to enhance cross-lingual transfer in multilingual embeddings.

Abstract

Prior research has investigated the impact of various linguistic features on cross-lingual transfer performance. In this study, we investigate the manner in which this effect can be mapped onto the representation space. While past studies have focused on the impact on cross-lingual alignment in multilingual language models during fine-tuning, this study examines the absolute evolution of the respective language representation spaces produced by MLLMs. We place a specific emphasis on the role of linguistic characteristics and investigate their inter-correlation with the impact on representation spaces and cross-lingual transfer performance. Additionally, this paper provides preliminary evidence of how these findings can be leveraged to enhance transfer to linguistically distant languages.

Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space

TL;DR

The paper tackles how linguistic distance shapes cross-lingual transfer by examining the absolute evolution of language representation spaces in multilingual language models during fine-tuning. It introduces a layer-wise, CKA-based framework to quantify the impact of fine-tuning on each language's representation space via the metric , and uses URIEL-based distance metrics from lang2vec to correlate these changes with linguistic distance. The study reveals that genetic distance correlates across all layers with representation-space impact, while syntactic and geographic distances show stronger correlations in deeper layers; cross-lingual transfer correlates with representation-space impact, especially in deeper layers. It also provides preliminary evidence that selective freezing of layers—targeting those with strong negative correlations to distance—can mitigate transfer gaps to linguistically distant languages, offering a potential strategy for improving zero-shot performance in distant languages. Overall, the work highlights an inter-connected triad among language distance, representation-space evolution, and transfer performance, proposing a path for targeted interventions to enhance cross-lingual transfer in multilingual embeddings.

Abstract

Prior research has investigated the impact of various linguistic features on cross-lingual transfer performance. In this study, we investigate the manner in which this effect can be mapped onto the representation space. While past studies have focused on the impact on cross-lingual alignment in multilingual language models during fine-tuning, this study examines the absolute evolution of the respective language representation spaces produced by MLLMs. We place a specific emphasis on the role of linguistic characteristics and investigate their inter-correlation with the impact on representation spaces and cross-lingual transfer performance. Additionally, this paper provides preliminary evidence of how these findings can be leveraged to enhance transfer to linguistically distant languages.
Paper Structure (24 sections, 2 equations, 3 figures, 3 tables)

This paper contains 24 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pearson correlation coefficient between the impact on a target language's representation space when fine-tuning in a source language and different types of linguistic distances between the source and target language for each layer. Same source-target language pair data points were excluded in order to prevent an overestimation of effects. (${}^{*} p<0.05$, and ${}^{**} p<0.01$, two-tailed).
  • Figure 2: Pearson correlation coefficients between the impact on the representation space and different types of linguistic distances (with English as the only source language). (${}^{*} p<0.05$, and ${}^{**} p<0.01$, two-tailed).
  • Figure 3: Cross-lingual zero-shot transfer results for XNLI