Unsupervised Bilingual Lexicon Induction for Low Resource Languages
Charitha Rathnayake, P. R. S. Thilakarathna, Uthpala Nethmini, Rishemjith Kaur, Surangika Ranathunga
TL;DR
This paper tackles the scarcity of bilingual lexicons for low-resource languages by evaluating unsupervised, structure-based BLI with VecMap (UVecMap) and systematically combining multiple improvements. It extends the baseline with dimensionality reduction, embedding creation strategies, embedding pre-processing, and initialization techniques, and merges static and contextual embeddings through CSCBLI to build a unified cross-lingual space enhanced by a spring-network offset and contrastive training. Comprehensive experiments on English–Sinhala, English–Tamil, and English–Punjabi identify a synergistic combination—CSCBLI with Linear Transformation and UVecMap—that consistently improves precision@1 over the baseline, while also revealing some language-specific sensitivities (e.g., potential mapping failures with EnTa FastText). The authors release new human-curated dictionaries for EnSi and EnPa and discuss practical limitations (GPU memory, manual hyperparameter tuning) with directions for scalable automation and broader language coverage, underscoring the approach’s potential to improve downstream NLP tasks for LR languages.
Abstract
Bilingual lexicons play a crucial role in various Natural Language Processing tasks. However, many low-resource languages (LRLs) do not have such lexicons, and due to the same reason, cannot benefit from the supervised Bilingual Lexicon Induction (BLI) techniques. To address this, unsupervised BLI (UBLI) techniques were introduced. A prominent technique in this line is structure-based UBLI. It is an iterative method, where a seed lexicon, which is initially learned from monolingual embeddings is iteratively improved. There have been numerous improvements to this core idea, however they have been experimented with independently of each other. In this paper, we investigate whether using these techniques simultaneously would lead to equal gains. We use the unsupervised version of VecMap, a commonly used structure-based UBLI framework, and carry out a comprehensive set of experiments using the LRL pairs, English-Sinhala, English-Tamil, and English-Punjabi. These experiments helped us to identify the best combination of the extensions. We also release bilingual dictionaries for English-Sinhala and English-Punjabi.
