A Survey Of Cross-lingual Word Embedding Models
Sebastian Ruder, Ivan Vulić, Anders Søgaard
TL;DR
This survey documents cross-lingual word embedding models through a unifying typology based on data signals and supervision (word-, sentence-, and document-level, parallel vs. comparable). It demonstrates that many approaches optimize essentially the same objectives, differing mainly in data and optimization strategy, and it highlights mappings, pseudo-bilingual, and joint learning as connected paradigms. The authors provide historical context, discuss evaluation frameworks and benchmarks, and map multilingual extensions from bilingual models, including pivot-language strategies. They also outline practical challenges and future directions, such as subword information, multi-word expressions, polysemy, and robust unsupervised methods, emphasizing data quality and compatibility over architectural novelty. Overall, the work offers a comprehensive, standardized view of cross-lingual embeddings and guides future research toward data-centric improvements and multilingual scalability.
Abstract
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.
