Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language
Kasun Wickramasinghe, Nisansa de Silva
TL;DR
This work tackles the problem of aligning Sinhala and English word embeddings, a challenge amplified by Sinhala's low-resource status. It presents a benchmark and new alignment datasets to enable supervised cross-lingual embedding alignment, and evaluates several alignment techniques (including Procrustes, CSLS, and RCSLS) using FastText embeddings. The study finds that cc-based Sinhala embeddings yield better alignment than wiki-based ones, and that larger, frequency-focused dictionaries improve precision, though Sinhala–English alignment remains behind high-resource pairs. The authors discuss data quality and methodological choices and propose semi-supervised and unsupervised avenues to improve future alignment, aiming to provide a solid foundation for Sinhala in multilingual NLP tasks.
Abstract
Since their inception, embeddings have become a primary ingredient in many flavours of Natural Language Processing (NLP) tasks supplanting earlier types of representation. Even though multilingual embeddings have been used for the increasing number of multilingual tasks, due to the scarcity of parallel training data, low-resource languages such as Sinhala, tend to focus more on monolingual embeddings. Then when it comes to the aforementioned multi-lingual tasks, it is challenging to utilize these monolingual embeddings given that even if the embedding spaces have a similar geometric arrangement due to an identical training process, the embeddings of the languages considered are not aligned. This is solved by the embedding alignment task. Even in this, high-resource language pairs are in the limelight while low-resource languages such as Sinhala which is in dire need of help seem to have fallen by the wayside. In this paper, we try to align Sinhala and English word embedding spaces based on available alignment techniques and introduce a benchmark for Sinhala language embedding alignment. In addition to that, to facilitate the supervised alignment, as an intermediate task, we also introduce Sinhala-English alignment datasets. These datasets serve as our anchor datasets for supervised word embedding alignment. Even though we do not obtain results comparable to the high-resource languages such as French, German, or Chinese, we believe our work lays the groundwork for more specialized alignment between English and Sinhala embeddings.
