An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
Wazir Ali, Saifullah Tumrani, Jay Kumar, Tariq Rahim Soomro
TL;DR
This work tackles the scarcity of Sindhi language resources by assembling a large unlabeled corpus (>61 million words) from diverse web sources and applying three embedding paradigms (GloVe, CBoW, Skip-gram). Through a designed preprocessing pipeline and a curated Sindhi stop-word list, the authors train embeddings and evaluate them with both intrinsic and extrinsic tasks, consistently finding that continuous-bag-of-words and skip-gram models outperform GloVe and existing Sindhi fastText embeddings. Intrinsic assessments (nearest neighbors, word-pair relationships, WordSim-347) and extrinsic tasks (POS tagging and NER on SiPOS/SiNER) show that SG-based embeddings capture stronger semantic/syntactic relationships and yield better downstream performance. The study contributes a sizable Sindhi corpus, a stop-word list, and a comparative analysis of embedding methods, with implications for improved Sindhi NLP applications and potential extensions to WordNet construction and contextualized models like BERT/ELMo/GPT.
Abstract
In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches
