Assessing the Role of Lexical Semantics in Cross-lingual Transfer through Controlled Manipulations
Roy Ilani, Taelin Karidi, Omri Abend
TL;DR
The paper investigates which linguistic properties most influence cross-lingual transfer by engineering artificial English variants that emulate target-language traits and evaluating their alignment to an English pretrained space via knowledge distillation. It distinguishes three properties—script, word order, and lexical semantics—and finds that lexical matching, quantified through translation entropy in a bipartite lexicon graph, largely drives zero-shot transfer performance, while script and syntax exert limited impact. Through intrinsic cosine-based alignment and extrinsic XNLI evaluation, the study demonstrates that swapping high-entropy lexical items substantially degrades transfer, underscoring the importance of one-to-one lexicon alignments between languages. These results suggest that future cross-lingual systems should prioritize aligning lexical semantics across languages to improve zero-shot capabilities, particularly for low-resource languages. The methodology provides a principled framework for quantifying lexicalization differences and their effects on representation transfer.
Abstract
While cross-linguistic model transfer is effective in many settings, there is still limited understanding of the conditions under which it works. In this paper, we focus on assessing the role of lexical semantics in cross-lingual transfer, as we compare its impact to that of other language properties. Examining each language property individually, we systematically analyze how differences between English and a target language influence the capacity to align the language with an English pretrained representation space. We do so by artificially manipulating the English sentences in ways that mimic specific characteristics of the target language, and reporting the effect of each manipulation on the quality of alignment with the representation space. We show that while properties such as the script or word order only have a limited impact on alignment quality, the degree of lexical matching between the two languages, which we define using a measure of translation entropy, greatly affects it.
