Table of Contents
Fetching ...

Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks

Roksana Goworek, Haim Dubossarsky

TL;DR

The study interrogates whether multilingual exposure genuinely boosts zero-shot transfer for sense-aware NLP tasks, focusing on polysemy disambiguation (WiC) and lexical semantic change (LSCD) across 28 languages. By employing a controlled, large-scale framework with fixed fine-tuning sizes and held-out languages across multiple multilingual bases, it isolates the effect of multilingual fine-tuning. The findings indicate that multilinguality is not necessary nor consistently beneficial; pretraining size and data quality primarily drive transfer, with language similarity playing a secondary role. The work highlights data-centric factors as key determinants of cross-lingual performance and offers fine-tuned baselines to inform future research on transfer, especially for low-resource languages.

Abstract

Cross-lingual transfer is central to modern NLP, enabling models to perform tasks in languages different from those they were trained on. A common assumption is that training on more languages improves zero-shot transfer. We test this on sense-aware tasks-polysemy and lexical semantic change-and find that multilinguality is not necessary for effective transfer. Our large-scale analysis across 28 languages reveals that other factors, such as differences in pretraining and fine-tuning data and evaluation artifacts, better explain the perceived benefits of multilinguality. We also release fine-tuned models and provide empirical baselines to support future research. While focused on two sense-aware tasks, our findings offer broader insights into cross-lingual transfer, especially for low-resource languages.

Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks

TL;DR

The study interrogates whether multilingual exposure genuinely boosts zero-shot transfer for sense-aware NLP tasks, focusing on polysemy disambiguation (WiC) and lexical semantic change (LSCD) across 28 languages. By employing a controlled, large-scale framework with fixed fine-tuning sizes and held-out languages across multiple multilingual bases, it isolates the effect of multilingual fine-tuning. The findings indicate that multilinguality is not necessary nor consistently beneficial; pretraining size and data quality primarily drive transfer, with language similarity playing a secondary role. The work highlights data-centric factors as key determinants of cross-lingual performance and offers fine-tuned baselines to inform future research on transfer, especially for low-resource languages.

Abstract

Cross-lingual transfer is central to modern NLP, enabling models to perform tasks in languages different from those they were trained on. A common assumption is that training on more languages improves zero-shot transfer. We test this on sense-aware tasks-polysemy and lexical semantic change-and find that multilinguality is not necessary for effective transfer. Our large-scale analysis across 28 languages reveals that other factors, such as differences in pretraining and fine-tuning data and evaluation artifacts, better explain the perceived benefits of multilinguality. We also release fine-tuned models and provide empirical baselines to support future research. While focused on two sense-aware tasks, our findings offer broader insights into cross-lingual transfer, especially for low-resource languages.

Paper Structure

This paper contains 47 sections, 1 equation, 9 figures, 29 tables.

Figures (9)

  • Figure 1: Mean accuracies and sd (bars) for multilingual and monolingual models on WiC datasets using different pretrained models. Colors indicate whether fine-tuning was done on all data or on its sampled portion (Hindi and Chinese appear only in the former due to their smaller data which did not allow subsampling). Hindi, as a single language dataset, does not have sd. For detailed results see Appendix \ref{['app:all_results']}.
  • Figure 2: Mean correlations between languages pretraining sizes and zero-shot performance of mono models.
  • Figure 3: MuRIL accuracy scores. "(not in PT)" are languages absent from MuRIL’s pretraining.
  • Figure 4: Proportions of languages in WiC datasets used for training. multi is trained on all of them except Hindi.
  • Figure 5: Mean correlations of syntactic similarity (between the fine-tuning language of mono models and the target language) and zero-shot performance on the target languages. Superimposed on the correlation reported in Figure \ref{['fig:pt-corr']}) for comparison to correlations with pretraining sizes.
  • ...and 4 more figures