MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness
Shijia Zhou, Huangyan Shan, Barbara Plank, Robert Litschko
TL;DR
This work tackles cross-lingual semantic textual relatedness under zero supervision for 12 target languages in SemEval-2024 Track C. It systematically compares source-language strategies using XLM-R and Furina, including single-source, multi-source, kNN-based augmentation, language-family transfer, transliteration, and MT-based data augmentation. Key findings show that multi-source transfer generally improves STR, though benefits depend on data size and language proximity, while script differences and MT-related label shifts can complicate transfer; Furina fine-tuned on eng+esp+hau achieved first place on kin. The study provides practical guidance on donor-language selection for low-resource languages and highlights the trade-offs between breadth of data and language similarity in cross-lingual transfer.
Abstract
This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences in a given target language without access to direct supervision (i.e. zero-shot cross-lingual transfer). To this end, we focus on different source language selection strategies on two different pre-trained languages models: XLM-R and Furina. We experiment with 1) single-source transfer and select source languages based on typological similarity, 2) augmenting English training data with the two nearest-neighbor source languages, and 3) multi-source transfer where we compare selecting on all training languages against languages from the same family. We further study machine translation-based data augmentation and the impact of script differences. Our submission achieved the first place in the C8 (Kinyarwanda) test set.
